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Preface 



According to Leo Breiman (1968), probability theory has a right and a left hand. 
The right hand refers to rigorous mathematics, and the left hand refers to ‘proba- 
bilistic thinking’. The combination of these two aspects makes probability theory 
one of the most exciting fields in mathematics. One can study probability as a 
purely mathematical enterprise, but even when you do that, all the concepts that 
arise do have a meaning on the intuitive level. For instance, we have to define what 
we mean exactly by independent events as a mathematical concept, but clearly, 
we all know that when we flip a coin twice, the event that the first gives heads is 
independent of the event that the second gives tails. 

Why have I written this book? I have been teaching probability for more 
than fifteen years now, and decided to do something with this experience. There 
are already many introductory texts about probability, and there had better be a 
good reason to write a new one. I will try to explain my reasons now. 

The typical target that I have in mind is a first year student who wants or 
needs to learn about probability at a stage where he or she has not seen measure 
theory as yet. The usual and classical way to set things up in this first year, is 
to introduce probability as a measure on a sigma-algebra, thereby referring to 
measure theory for the necessary details. This means that the first year student is 
confronted with a theory of probability that he or she cannot possibly understand 
at that point. I am convinced that this is not necessary. 

I do not (of course) claim that one should not base probability theory on 
measure theory later in the study, but I do not really see a reason to do this in the 
first year. One can — as I will show in this book — study discrete and continuous 
random variables perfectly well, and with mathematical precision, within the realm 
of Riemann integration. 

It is not particularly difficult to write rigorously about discrete probability, 
but it is harder to do this more generally. There are several texts available which 
do promise this (no measure theory and rigorous) but I don’t think that any of 
these texts can really claim to have kept its promise. I have achieved precision 
without measure theory, by deviating from the classical route of defining proba- 
bility measures on sigma- algebras. In this book, probabilities of events are defined 
as soon as a certain (Riemann) integral exists. This is, as I will show, a very natu- 
ral thing to do. As a result, everything in this book can be followed with no more 
background than ordinary calculus. 

As a result of my approach, it will become clear where the limits of this 
approach are, and this in fact forms the perfect motivation to study measure 
theory and probability theory based on measure theory later in the study. Indeed, 
by the end of the book, the student should be dying to learn more about measure 
theory. 
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Hence this approach to probability is fully consistent with the way math- 
ematics works: first there is a theory and you try to see how far this gets you, 
and when you see that certain (advanced) things cannot be treated in this theory, 
you have to extend the theory. As such, by reading this book, one not only learns 
a lot about probability, but also about the way mathematics is discovered and 
developed. 

Another reason to write this book is that I think that it is very important that 
when students start to learn about probability, they come to interesting results as 
soon as possible, without being disturbed by unnecessary technical complications. 
Probability theory is one of those fields where you can derive very interesting, 
important and surprising results with a minimum of technical apparatus. Doing 
this first is not only interesting in itself. It also makes clear what the limits are of 
this elementary approach, thereby motivating further developments. For instance, 
the first four chapters will be concerned with discrete probability, which is rich 
enough to say everything about a finite number of coin flips. The very surprising 
arc-sine law for random walks can already be treated at this point. But it will 
become clear that in order to describe an infinite sequence of such flips, one needs 
to introduce more complicated models. The so-called weak law of large numbers 
can be perfectly well stated and proved within the realm of discrete probability, 
but the strong law of large numbers cannot. 

Finally, probability theory is one of the most useful fields in mathematics. As 
such, it is extremely important to point out what exactly we do when we model a 
particular phenomenon with a mathematical model involving uncertainty. When 
can we safely do this, and how do we know that the outcomes of this model do 
actually say something useful? These questions are addressed in the appendix. I 
think that such a chapter is a necessary part of any text on probability, and it 
provides a link between the left and right hand of Leo Breiman. 

A few words about the contents of this book. The first four chapters deal with 
discrete probability^ where the possible outcomes of an experiment form a finite or 
countable set. The treatment includes an elementary account on random walks, 
the weak law of large numbers and a primitive version of the central limit theorem. 
We have also included a number of confusing examples which make clear that it 
is sometimes dangerous to trust your probabilistic intuition. 

After that, in the Intermezzo, we explain why discrete probability is insuffi- 
cient to deal with probabilistic concepts which include infinitely fine operations, 
such as choosing a point on a line segment. 

After the Intermezzo, Chapter 5 deals with these infinitely fine operations, 
a subject which goes under the name continuous probability. In Chapter 6 we 
continue with another important probabilistic concept which can not be captured 
by discrete probability, namely infinitely many repetitions of an operation, like 
infinitely many coin flips. We discuss applications to branching processes, random 
walk and strong laws of large numbers. 
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XI 



Chapter 7 is devoted to one of the most important stochastic processes, the 
Poisson process, where we shall investigate a very subtle and beautiful interplay 
between discrete and continuous random variables. 

In Chapter 8, we discuss a number of limit theorems based on characteristic 
functions. A full proof of the central limit theorem is available at that point. 
Finally, in Chapter 9, we explore the limitations of the current approach. We will 
outline how we can extend the current theory using measure theory. This final 
chapter provides the link between this book and probability based on measure 
theory. 

Previous versions of this book were read by a number of people, whose 
comments were extremely important. I would very much like to thank Hanneke 
de Boer, Lorna Booth, Karma Dajani, Odo Diekmann, Massimo Franceschetti, 
Richard Gill, Klaas van Harn, Rob Kooijman, Rein Nobel, Corrie Quant, Rahul 
Roy, Jeffrey Steif, Freek Suyver, Aad van der Vaart and Dmitri Znamenski for their 
valuable comments. In particular, I would also like to thank Shimrit Abraham for 
doing a large number of exercises. 

Ronald Meester, Amsterdam, summer 2003 




Chapter 1 

Experiments 



We start our investigations with a number of elementary examples which involve 
probability. These examples lead to a definition of an experiment^ which consists 
of a space of possible outcomes, together with an assignment of probabilities to 
each of these outcomes. We define and study basic concepts associated with ex- 
periments, including combinatorics, independence, conditional probabilities and a 
first law of large numbers. 



1.1 Definitions and Examples 

When we toss a coin, and have no particular reason to favour heads or tails, 
then anybody will be happy to assign probability 1 /2 to each of the two possible 
outcomes. Why is this a reasonable choice? Well, in the first place, if the coin is 
fair, then heads and tails should receive the same probability. Fine, but why do 
we choose probability 1/2 each, and not, say 1/10? The reason for this has to do 
with our intuitive notion of what a probability means. If we toss the coin k times, 
and the number of heads among these k tosses is kh, then the relative frequency 
of heads is equal to kh/k. Now if k is large, then we tend to think about kh/k 
as being close to the probability of heads. The relative frequency of tails can be 
written as kt/k, where kt is the number of tails among the k tosses, and we again 
think of kt/k as being close to the probability of tails. Since 

-1 

we see that, at least intuitively, the probabilities of heads and tails should add up 
to 1. 

In this simple example, the two possible outcomes form the set 



Q = {head, tail}. 
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and we can formalise the assignment of probabilities by defining a function 






given by 

p(head) =p(tail) = 1/2. 

More generally, if we want to select a number from the set ft = {1, 2, . . . , A/"} 
without any preference, then we can model this with a function p : [0, 1] 

defined by 

P{i) = = 

that is, we assign probability 1/N to each possible outcome. If someone now asks 
about the probability of seeing either a 1 or a 2, we simply add up the probabilities 
of 1 and 2 to obtain 2/N (assuming of course that N >2). 

To see that this adding of probabilities is reasonable, imagine that we choose 
a number between 1 and k times. Denoting the number of Is and 2s among 
these k numbers by ki and ^2 respectively, then, as discussed above, the relative 
frequency ki/k should be close to the probability of a 1. Similarly, the relative 
frequency of seeing a 1 or a 2, should be close to the probability of seeing a 1 or 
a 2. Since the relative frequency of a 1 or a 2 is equal to (fci + k 2 )/k, this leads to 
the conclusion that the probability of seeing a 1 or a 2 should indeed be the sum 
of the probabilities of 1 and 2. 

In these first two examples, all outcomes have the same probability. It is, 
however, not always appropriate to assign equal probability to all possible out- 
comes. For instance, suppose that we have an urn which contains three red and 
two blue balls, so that balls with the same colour are indistinguishable. When we 
take a ball from the urn without looking, and we are only interested in the colour, 
then the possible outcomes of this experiment are ‘red’ and ‘blue’. The probability 
of ‘red’ should, of course, be 3/5, and the probability of ‘blue’ should be 2/5. 

Another point of concern is that the number of possible outcomes need not 
be finite. For instance, if you would like to make a probabilistic statement about 
people’s favourite number, then it is not appropriate to take a finite sample space, 
since there is no a priori bound on this number. In such cases it is often natural 
to take Q = N = {0, 1, 2, . . .}. 

In the first four chapters of this book, we shall only be concerned with proba- 
bilistic experiments with a finite or countably infinite number of possible outcomes. 
Loosely speaking, an infinite set S is countably infinite if we can list the elements 
in S' as S' = {si, S 2 , . . .}. The formal definition is as follows. 

Definition 1.1.1. A set S is said to be countably infinite if there is a one-to-one 
correspondence between N and S. A set which is either finite or countably infinite 
is said to be countable. 



For instance, the set Z of all integers is countable since we can list all elements 
in Z as 



Z = {0,-1, 1,-2, 2,...}. 
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The reason that we stick to countably many outcomes, at least for the time being, is 
that we do not want to be disturbed by purely technical problems at the beginning 
of the development. Most (but not all!) probabilistic concepts are already very 
interesting in this reduced context, and we think that it is better to obtain some 
probabilistic intuition before going into so-called continuous probability, which will 
have to wait until Chapter 5. 

The above examples lead to the following formal definition of an experiment. 

Definition 1.1.2. An experiment is a countable set of outcomes D, called the sample 
space^ together with a function p : ^ [0, 1], with the property that 

The function p is called a probability mass function. For A C 17, we define the 
probability of A by 

p{A) = J2p{^)^ 

Cx^ G -A 

where an empty sum is defined to be 0. A subset of 17 is called an event. The map 
P from the collection of subsets of 17 into [0, 1] is called a probability measure. 

A Exercise 1.1.3. Show that P(17) = 1 and that P(0) = 0. 

Example 1.1.4. (Throwing a die) When we throw a die, the obvious choice of the 
sample space is 17 = {1, 2, 3, 4, 5, 6}, and the probability mass function should be 
given by p{i) — 1/6, i = 1, ... ,6. The probability of the event {2,4,6} that the 
outcome is even is now easily seen to be 

■P({2,4,6}) =p(2) +p(4) +p(6) = □ 

Example 1.1.5. Let us look at families with two children. What is the probability 
that there is at least one son? To give an answer to this question, we have to model 
this with an experiment. An appropriate sample space is 

n = {BB,BG,GB,GG}, 

where for instance the outcome BG corresponds to the eldest child being a son, 
and the youngest a girl. It is reasonable (but see the next exercise) to assign equal 
probability to all four outcomes. The probability of having at least one son is now 
the probability of the event {BB, BG^GB} and this probability clearly is 3/4. 

□ 



4 Exercise 1.1.6. What would be your response to someone who takes 

n = {BB,BG,GG}, 

where the outcomes represent two boys, one boy, and no boy respectively, and who 
assigns equal probability to these three outcomes? 
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Example 1.1.7. Suppose we want to make probabilistic statements about the num- 
ber of sons and daughters in families. A possible model is the following. The sample 
space can be taken as = {(i, j), i, j G N}, where the outcome (i, j) corresponds 
to i sons and j daughters. A possible choice for the probability mass function is 






To see that this indeed is a probability mass function, we write 



EE 2 -*-'-^ 



2=0 j = 0 



E2-.-‘E2 



-j-1 



i=0 J=0 

1 . 



What is the probability that there are no sons? The event that there are no sons 
can be written as A = {(0, 0), (0, 1), (0, 2), . . .} and the probability P{A) of this 
event is 

oo oo 

P{A) = J^p((0j)) = ^2-^-2 = 1/2. □ 

j=0 j=0 

^ Exercise 1.1.8. Do you think that this is a realistic probability mass function? 
Why? 

The probability of the outcome cj can be written in terms of the probability 
mass function as p(u) or in terms of the probability measure as P({cj}). The 
difference is that p assigns a probability to an element of f2, whereas P assigns 
a probability to a subset of fi. However, we shall often abuse notation, and write 
P{io) instead of P({o;}). 

A useful way to think about an experiment is to imagine that you ‘pick a 
point’ from in such a way that the probability to pick lu is just P{lo). In this 
interpretation, we say that P{A) is the probability that A occurs, that is, P{A) 
should be thought of as the probability that the chosen a; is a point in A. 

For an event A, we write A^ for the event Q\A, that is, the event that A does 
not occur: 

A^ = {uj:uj^ A}. 

The event A^ is called the complement of A. 

For two events A and B, we write A fl H for the event that A and B both 
occur, that is. 

An B = {uj : LU e A and uj G B}. 

The event A H B is called the intersection of A and B. 

Finally, we write A U B for the event that at least one of the events A or B 
occurs, that is 

Au B = {lo : LO ^ A or LU e B}. 

The event A U B is called the union of A and B. 
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We sometimes write A\B for AnB^, that is, the event that A occurs, but B 
does not. When two events A and B have no common element, that is, = 0, 

we say that A and B are disjoint. The collection of events Ai,A 2 ,... is called 
(pairwise) disjoint, if Ai n Aj = 9 for all i ^ j. 

Ex 2 imple 1.1.9. Consider Example 1.1.4, and let A be the event that the outcome 
is even, and B be the event that the outcome is at most 4. Then A = {2,4,6}, 
B - (1, 2,3,4}, AUB = (1,2, 3,4, 6}, AnB = {2,4} and A^ = {1,3,5}. □ 



Very often it is the case that all possible outcomes of an experiment have 
the same probability, see for instance Example 1.1.4 and Example 1.1.5 above. In 
such cases we can compute the probability of an event A by counting the number 
of elements in A, and divide this by the number of elements in fl. This is precisely 
the content of the following proposition. We denote the number of elements in a 
set A by \A\. 



Proposition 1.1.10. Let fl be a finite sample space, and let P be a probability 
measure that assigns equal probability to each outcome. Then 



for all A CQ. 



P{A) = 



|Q|’ 



Proof. Since P{oo) is the same for all a; G fl, and Q contains \Q\ elements, we have 
that P{lo) = for all a; G fl. Hence, 



P{A] 







\n\- 



□ 



4 Exercise 1.1.11. Show that in a countably infinite sample space, it is impossible 
to assign equal probability to all possible outcomes. (What would happen if this 
common probability were positive? What would happen if this common proba- 
bility were zero?) As a consequence, it is impossible to choose a natural number 
randomly, in such a way that all numbers have the same probability. 

Example 1.1.12. Consider Example 1.1.4 again. The probability of the event B = 
{1,2, 3, 4} is now easily seen to be 4/6=2/3. Similarly, the probability of A = 
{2,4,6} is simply 3/6-1/2. □ 

The previous example suggests that it is easy to apply Proposition 1.1.10. 
Indeed, in this example, there were six outcomes possible, and the number of 
elements in B was simply four, leading to a probability of 4/6. However, counting 
the number of elements in a set is not always an easy matter, and sometimes 
relies on tricky arguments. We devote a separate section to methods of counting 
in various situations, with applications to experiments where all outcomes are 
equally likely. 
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1.2 Counting and Combinatorics 

In combinatorics, we distinguish between ordered and unordered sets. In an ordered 
set, the order plays a role, whereas in an unordered set, it does not. For instance, 
the list of all ordered subsets of size 2 of {1,2,3} consists of (1,2), (2,1), (1,3), 
(3, 1), (2,3) and (3,2); the list of unordered subsets of size two consists of {1,2}, 
{1,3} and {2,3}. The set {2,1} is the same as {1,2} and therefore not listed 
separately. Note that we write ordered sets between round brackets, and unordered 
subsets between curly brackets. When we talk about a ‘subset’, without prefix, then 
we mean an unordered subset; when the ordering is important this will always be 
mentioned specifically. 

All theory of this section is contained in the following basic result. After the 
proposition and its proof, we give a number of examples. 

Proposition 1.2.1. (Combinatorial counting methods) 

(a) The number of sequences of length k with n symbols is n ^ . 

(b) The number of ordered subsets of k elements from a set with n elements is 
equal to 

n X (n — 1) X • • • X (n — A: + 1). 

In particular, there are n\ ways to order a set of n elements. 

(c) The number of subsets of k elements from a set with n elements, denoted by 
(^), is equal to 



fn\ n X (n — 1) X • • • X (n — A: + 1) n! 

\A; J k\ k\{n — k )\ * 

This number is called ‘n choose k \ 

Proof. For (a), note that for the first element of the sequence we have n possibili- 
ties. For the second element, we again have n possibilities, and combined with the 
first element, this gives possibilities for the first two elements. Continue now in 
the obvious way. 

For (b), note that for the first element we have n possibilities. After choosing 
the first element, we have only n — 1 possibilities for the second, etcetera. 

For (c), observe that we ask here for the number of subsets, and then the 
order is irrelevant. Each collection of k elements can be obtained in k\ different 
orderings, so to obtain the number of subsets of size fc, we need to take the number 
of ordered subsets, and divide by A;!, that is 

n X (n — 1) X • • • X (n — A: -h 1) ^ 

k\ ■ ° 

Example 1.2.2. (Drawing with replacement) Consider an urn with eight balls, 
numbered 1, . . . , 8. We draw three balls with replacement, that is, after drawing a 
ball, we note its number and put it back into the urn, so that it may be drawn a 
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second or even a third time. The sample space Q of this experiment consists of all 
sequences of length 3, with the symbols 1, . . . , 8. According to Proposition 1.2.1(a), 
Q has 8^ = 512 elements. When we make the assumption that each sequence of 
length 3 has the same probability, this leads to the conclusion that any given 
sequence, for instance (4,4,8) has probability 1/512 to occur. □ 

Example 1.2.3. (Drawing an ordered collection without replacement) Consider 
the same urn, with the same balls. We now draw three balls without replacement, 
that is, a chosen ball is not put back in the urn. We note the numbers of the 
chosen balls, in order. The sample space Q' corresponding to this event is the set 
consisting of all sequence of length 3 with the symbols 1, . . . , 8, where each symbol 
can appear at most once. The number of elements in is the number of ordered 
subsets of size 3. According to Proposition 1.2.1(b), this is equal to 8 x 7 x 6 = 336. 
The probability to see (3, 7, 1) (in this order) is then 1/336, under the assumption 
that all outcomes have the same probability. □ 

Example 1.2.4. (Drawing an unordered subset) Consider the same urn once more, 
this time choosing three balls simultaneously, so that the order is irrelevant. This 
experiment corresponds to the sample space which consists of all subsets of size 

3 of a set with eight elements. According to Proposition 1.2.1(c), fl" has (3) = 56 

elements. Under the assumption that all outcomes have the same probability, the 
probability to select the set {3,7, 1} is now 1/56. Note that this is six times the 
probability of the event in the previous example. The reason for this is that the 
set (3, 7, 1} can appear in 3! = 6 different orderings. □ 

Example 1.2.5. (Master Mind) In the game of Master Mind, there are four gaps 
in a row, which can be filled with little coloured sticks. There are six different 
colours for the sticks. One can ask how many combinations there are to do this. 
This depends on the rules that we agree on. 

If we insist that all four gaps are actually filled with different colours, then 
the number of ways to do this, is the same as the number of ways to choose an 
ordered set of size 4 from a set of six elements, that is,6x5x4x3 = 360. 

If we allow gaps, that is, if we allow a gap not to be filled, then the number 
of possibilities is7x6x5x4 = 840. 

Finally, if we allow gaps and the same colour to be used more than once, 
then the number of possibilities is just the number of ordered sequences of length 

4 using seven symbols, that is, 7^ = 2401. This last number appears in commercials 

for Master Mind as the total number of possible combinations. □ 

4 Exercise 1.2.6. How many possibilities are there when gaps are not allowed, but 
multiple use of the same colour is? 

An application of Proposition 1.2.1 need not always be completely straight- 
forward. In the following example, we need to use it twice. 
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Example 1.2.7. Suppose that we have four math books, five language books and 
two art books which we want to put on a shelf. If we put them in a completely 
random order, what is the probability that the books are grouped per subject? 

To solve this problem, we first have to decide on the total number of possi- 
bilities to put the books on the shelf. An application of Proposition 1.2.1(b) tells 
us that this number is 11!. How many ways are there to have the books grouped 
per subject? Well, first of all, there are 3! differents orderings of the three sub- 
jects. And within the math books we have 4! orderings, and similarly for the other 
subjects. Hence the total number of orderings so that all books are grouped per 
subject, is 3!4!5!2I. It follows, using Proposition 1.1.10 that the probability that 
all books are grouped per subject, is equal to 

3I415I2! 

= 0.000866. □ 

Example 1.2.8. (Fair coin tossing)) Suppose we flip a coin n times, where we do 
not expect the coin to favour head or tail: we say that the coin is unbiased. For 
notational convenience, we shall say that the possible outcomes are 0 and 1, rather 
than heads and tails. The sample space f] now consists of all sequences of Os and 
Is of length n. An element of is denoted by a; = (cji,cj 2 , • • • where each Wi 
is either 0 or 1. 

The total number of such sequences is 2^ (Proposition 1.2.1(a)), and since we 
assume that the coin is unbiased (that is, does not favour heads or tails), all pos- 
sible outcomes should be equally likely, and we therefore define a probability mass 
function by p{iu) = 2“^^, for all u G fl. This means that we can apply Proposition 
1.1.10, so to compute the probability of an event, we need to count the number 
of elements in this event. This counting method for computing probabilities is not 
always the best thing to do. (In Section 1.4 we shall encounter better and faster 
ways for computing certain probabilities.) Here are some examples of the counting 
method. 

Let A be the set 

A = \lo G : uji ==■ 1]-, 

that is, A is the event that the first coin flip yields a tail. The number of elements 
of A is 2^^“^, since the first digit is 1, and the other n — 1 digits are unrestricted. 
Therefore, P{A) = 2'^~^ / 2^ = 1/2, as intuition and common sense require. 

A more complicated event is the event H/e, defined as the set of outcomes in 
which we see k tails and n — k heads. Obviously, we assume that 0 < A: < n here. 
How do we count the number of elements in the set We need to know how 
many ways there are to choose k positions for the Is in a sequence of length n. But 
this is the same as choosing a subset of size k from a set of size n, and according 
to Proposition 1.2.1(c), this is simply (^). Hence, 



□ 
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Example 1.2.9. Suppose that we have four couples who meet for a dinner. They all 
sit down at a round table, in a completely random manner. What is the probability 
that no two people of the same sex sit next to each other? To answer this question, 
note that the number of possible configurations is 8!. Indeed, we can order the 8 
chairs around the table in a clockwise manner, say. For the ‘first’ chair, we have 8 
possible people, for the second there are 7 left, etcetera. Among these 8! possible 
outcomes, we need to count how many have no two people of the same sex next to 
each other. The first person can be of any sex, but when this person sits down, in 
the second chair, we have only 4 possibilities left. For the third chair, only 3 people 
are possible, etcetera. We conclude that the number of configurations with no two 
people of the same sex next to each other is equal to8x4x3x3x2x2xlxl. 
The probability of the event in question is then, according to Proposition 1.1.10, 
equal to 



8x4x3x3x2x2xlxl 

8! 



Example 1.2.10. When you play bridge, a deck of 52 cards is distributed over four 
players, denoted by north, south, west and east. Each player receives 13 cards. 
What is the probability that all spades end up at north and south? There are 
several ways to find the answer. 

Maybe the most natural thing to do is to look at the exact positions of all 
52 cards. We think of each of the four players as having 13 (ordered) positions 
available and we put one card in each position. Since the order is important, there 
are 52! different ways to distribute the cards. How many of these distributions 
lead to a situation in which all spades are either at north or at south? Well, think 
of 52 possible positions for the cards, 13 at each player, and imagine we first find 
positions for all the spades. For the first spade, there are 26 positions available, for 
the second only 25, etcetera. Then, when all the spades have found a position, we 
can distribute the remaining 39 cards as we like, and this can be done in 39! ways. 
Hence the total number of distributions with all spades at either south or north is 
given by 26 • 25 • • • 15 • 14 • 39!, and therefore the probability that this happens is 

26- 25- ••15 •14-39! 

52! ’ 

A simple compuation shows that this is equal to 

26!39! _ (“) 

13152! - (g) ^ 

When you stare at this last formula for a second, you might see a more direct way 
to arrive at this answer. Indeed, rather than looking at the exact positions of all 



4!3! 

"TT 

7' 

4 



-1 



= 0.0286. 



□ 
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52 cards, we could look only at the set of positions taken by the 13 spades. There 
are (^ 3 ) of such sets. On the other hand, the number of such sets at either north 
or south is (^ 3 ) , and this yields the same answer in a much more elegant way. □ 

The examples show that there is not much theory involved in combinatorics. 
For each new situation, we have to think again how the counting should be done. 
Often there are various approaches leading to the correct answer. The only way 
to get acquainted with combinatorics is to train yourself by doing exercises. 

1.3 Properties of Probability Measures 

In this section we collect and prove a number of useful properties of probability 
measures. Throughout the section, the sample space is denoted by Q and A, B, .. . 
are events in fl. 

Lemma 1.3.1. (a) For events ^ 1 ,^ 2 , .. . which are pairwise disjoint, we have 

( 00 \ 00 

=y^p(A,). 

i=i J i=i 



(b) P{A^) = 1-P{A). 

(c) If AC B, then P{A) < P{B). More precisely, we have that P{B) — P{A) + 
P{B\A). 

(d) P{A UB) = P{A) + P{B) - P{A n B). 

Proof, (a) We have 



^ = E PH+ PH + --- 

\z=l / cc’GAi CJGA 2 

00 

= 

i=l 

(b) Take Ai = A, A 2 = A^ and Aj = ili, for all j > 3. It follows from (a) that 
1 z:. P{n) = P{A U A^) = P{A) + P{A^), proving (b). 

(c) We can write B = A[J{B\A). This is a union of disjoint events, and the result 
now follows from (a). 

(d) We can write AU B = AU {B\A), which is a disjoint union. Hence we find 
that 



P{A UB) = P{A) + P{B\A) = P{A) + P{B\{A n B)) 
= P{A) + P{B)-P{AnB), 



where the last equality follows from (c). 



□ 
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The property proved in (a) is called countable additivity of the probability measure 
P. It expresses the very intuitive idea that we can add up probabilities of disjoint 
events, something we already anticipated in Section 1.1. The property in (b) is 
also very natural: for any event either A or its complement A^ occurs, but 
not both, and therefore their probabilities should add up to 1, as they do. The 
property in (c) simply states that when you make an event larger, its probability 
increases. Finally, (d) can be understood intuitively as follows: if you want to add 
all the probabilities of elements in the union of A and you can first add up 
everything in A, and then add up everything in B. However, by doing that, the 
elements that are in both A and B are counted twice, and we need therefore to 
subtract the probabilities in the intersection An B. 

The property in (d) can be generalised to more than two events. For instance, 
for three events we can write 



P{A UBUC) 



P{{AUB)UC) 

P{A UB)+ P{C) - P{{A UB)nC) 

P{A) + P{B) - P{A nB) + P{C) - P{{A nC)u{Bn C)) 
P{A) + P{B) - P{A n H) + P{C) 

-P{A nC)- P{B nC) + P{AnBn C). 



4 Exercise 1.3.2. Give an inductive proof of the so called inclusion- exclusion 
formula: for events Ai, . . . , An, it is the case that 



\i=l / i i<j i<j<k 

_i_(_i)n+ip(^i n ^2 n • • ■ n A^). 

4b Exercise 1.3.3. Show that for pairwise disjoint events Ai, ..., An, we have 

\i=l ) 2 = 1 

This is called finite additivity of the probability measure. 



1.4 Conditional Probabilities 

When we talk and think about probability, the concept of independence plays a 
crucial role. For instance, when we flip a coin twice, we are inclined to say that the 
outcome of the first flip ‘says nothing’ about the outcome of the second. Somehow, 
we believe that information about the first flip gives us no information about the 
outcome of the second. We believe that the two outcomes are independent of each 
other. 
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On the other hand, when we throw a die, and consider the event Es that 
the outcome is equal to 3, and the event E <4 that the outcome is at most 4, then 
information about E <4 does, in fact, change the probability of E^. Indeed, if I tell 
you that E <4 does not occur, then we know for sure that E^ cannot occur either, 
and hence the new probability of E^ had better be 0. If I tell you that ^<4 does 
occur, then there are four possibilities left. The new probability that E 3 occurs 
should therefore be see Example 1.4.2 below. 

The last argument can be carried out in much more general terms, as follows. 
Suppose I tell you that in a certain sample space fl, we have two events A and 
B, with probabilities P{A) and P{B) respectively. This means that a fraction 
P{A) of all probability mass is concentrated in the event A, and similarly for B. 
Now suppose that I know that the event B occurs. Does this new information 
change the probability of the event A? Well, we now know that only outcomes in 
B matter, and we can disregard the rest of the sample space. Hence we only need 
to look at the probabilities of elements in B. The new probability that A occurs 
should now be the fraction of probability mass in B that is also in A. That is, it 
should be the sum of the probabilities of all outcomes in B C\ A, divided by the 
probability of B. 

There is an alternative way to arrive at the same conclusion. As observed 
before, we like to interpret probabilities as relative frequencies. Suppose that we 
repeat a certain experiment k times (where k is large), and on each occasion we 
observe whether or not the events A and B occur. The number of occurrences of 
an event E is denoted by Conditioning on B means that we only look at those 
outcomes for which B occurs, and disregard all other outcomes. In this smaller 
collection of trails, the fraction of the outcomes for which A occurs is kAnB/ks 
which is equal to 



and this should be close to 



kAns/k 

ks/k 

P{A n B) 
P{B) ■ 



Definition 1.4.1. Suppose A and B are events in a sample space Q, and suppose 
that P{B) > 0. The conditional probability of A given B is defined as 

P{A n B) 



P{A\B) = 



P{B) 



Example 1.4.2. Suppose we throw a die. What is the conditional probability of 
seeing a 3, conditioned on the event that the outcome is at most 4? Well, denoting 
the event of seeing a 3 by E^, and the event that the outcome is at most 4 by 
E< 4 , we have, in the obvious sample space, that P{Es) = 1/6, P{E<i) = 2/3 and 
P{Es n £'<4) = £(£ 3 ) = 1/6. Hence 



£(£ 3 |£< 4 ) = 



£(£3 n £<4) 
£(£<4) 



1 

4’ 



which makes sense intuitively (why?). 



□ 
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Example 1.4.3. Consider fair coin tossing, and suppose we flip the coin twice. Let 
A be the event that the first coin flip is a 0, and let B be the event that at least one 
of the coin flips gives a 0. Then A = {(0, 1), (0, 0)} and B = {(0, 0), (0, 1), (1, 0)}. 
We can compute P{A\B) as follows: 



P{A\B) 



P{A n B) 

P{B) 

P((Q,0),(0,1)) ^1/2^2 

P((0,0),(0,1),(1,0)) 3/4 3 



□ 



Conditional probabilities can be quite counterintuitive, as suggested by the fol- 
lowing two examples. The first example also contains an important lesson about 
the choice of the sample space. 



Example 1.4.4. Suppose that we investigate families with two children. We as- 
sume that boys and girls are equally likely, and that the sex of the children are 
independent. The experiment of picking a random family with two children now 
corresponds to the sample space fi = {PP, PG, GP, GG}, where for instance PG 
indicates the outcome that the first child was a boy and the second a girl. The 
natural probability measure assigns equal probability to each outcome. The prob- 
ability that a family has at least one boy is now simply P{BB^ BG^GB) — 3/4. 
Now suppose that we obtain information that a family we have chosen has at least 
one boy. What is the conditional probability that the other child is also a boy? 
Well, we are asked about P{A\G) where A is the event {PP}, and G is the event 
{PP, PG, GP}. According to the formulas, we find that P{A\G) = P{AnG)/P{G) 
which is simply P(PP)/(3/4) = 1/3. This is already slightly counterintuitive per- 
haps, because one might think that the fact that we know that there is at least 
one boy, says nothing about the sex of the other child. Yet, it does. 

Now suppose that we visit a family with two children, and suppose that we 
know already that this family has at least one boy. We ring at the door, and a 
boy opens the door. This, apparently, is no new information. It seems that the 
only thing that happens now is that our previous knowledge about the existence 
of at least one boy is confirmed. But this is not true! We have performed a new 
experiment, and our original sample space ft = {PP, PG, GP, GG} does no longer 
suffice to describe this experiment. Indeed, we cannot learn from ft which child 
opened the door. What we need to do is enlarge our sample space, perhaps as 



ft' = {P*P,PP*,P*G,PG*,G*P,GP*,G*G,GG*}, 



where a * refers to the child that opened the door. So for instance PG* is the event 
that the first child is a boy, the second a girl, and that the girl opened the door. 
In this new experiment, it is reasonable to assign equal probability to all possible 
outcomes again. The new conditional probability that the other child is also a 
boy can be computed as follows. We want to know the conditional probability of 
{P*P,PP*} given {P*P,PP*,P*G, GP*} and this is now simply computed as 
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being equal to 1/2. Hence, the bare fact that a boy opened the door does change 
the conditional probability that the other child is also a boy. □ 

It is important that you realise what happened in this example. In order to 
give an answer to the second problem, we had to change the sample space. The 
first sample space Q was not big enough to contain the various events of the second 
experiment. 

Example 1.4.5. Consider the following situation. There are three people. A, B and 
C, in a room, and each of these three people gets a hat on their head which is 
either red or blue, in such a way that all eight possible ways to do this have the 
same probability 1/8. Everybody can see the hat of the other two people, but they 
can not see their own hat. The people in the room are not allowed to talk to each 
other. 

We view A,B and C as a team and we ask the team to select at least one 
of them who has to guess the colour of his own hat. Before the team entered 
the room, and before they receive their hats, they can have agreed on a certain 
strategy. The question is whether or not the team can come up with a strategy 
which yields only correct guesses with probability larger than 1/2. So if the team 
makes one guess, this guess should be correct; if the team (for whatever reason) 
decides to select two or three persons to guess their colour, all guesses must be 
correct for the team to win. 

Note that a strategy with success probability 1/2 is easy to find: the team 
can decide that A will always be the one who will make a guess, and then A 
simply flips a coin to decide between red and blue. The probability that A guesses 
correctly is then of course 1/2. In fact, it is hard to imagine that one can improve 
on this, since whatever strategy is used, someone has to make a guess, and once it 
has been decided that B, say, makes a guess, B, not knowing his own colour, will 
make the wrong guess with probability 1/2. 

Convincing as this may sound, it is not true. In fact, there is a strategy that 
yields the correct colour with probability as big as 3/4. Let me first describe the 
strategy, and then explain why the above reasoning is wrong. 

The team can decide on the following strategy: each of the three people A,B 
and C do the following. They look at the hats of the other two people. If these 
two hats have the same colour, then he (or she) does make a guess and guesses 
the other colour. Why does this work? Suppose that not all hats have the same 
colour, for instance A is red, B is red, and C is blue. In such a case only C makes 
a guess, and his guess will be correct. So unless all hats have the same colour, 
this strategy gives the right answer. If all hats are red, say, then all three people 
make a guess, and they will all be wrong, and hence the team loses. But when the 
colours are any of RRB, RBR, RBB, BRB, BBR or BRR (in the order of A,B and 
C), the team wins with this strategy. Hence the probability to win is 6/8=3/4. 

All right, so we have this strategy with success probability 3/4. But it is still 
not so clear what was wrong with the original reasoning which seemed to tell us 
that one can never get a success probability bigger than 1/2. The key to understand 
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this paradox has everything to do with conditional probabilities. In the reasoning 
above, we first assumed that A makes a guess, and given this fact, we claimed that 
the success probability cannot be bigger than 1/2. Well, this is correct, even for 
the present strategy, since the probability that A guesses correctly, given that A 
makes a guess, is equal to 

P(A makes a guess and guesses correctly) P{BRR, RBB) 

P{A makes a guess) P{BRR, RBB, BBB, RRR) 

1 

“ 2 - □ 

Conditional probabilities are very useful for computing unconditional probabilities 
of events, by using the following result. 

Definition 1.4.6. We call a countable collection of events B\,B 2 , . . . a partition of 
if ^ 1 , ^ 2 , . . . are pairwise disjoint and satisfy 

[JSi = Q. 

i 

Theorem 1.4.7. Let B\,B 2 ^ ... he a partition of such that P{Bi) > 0 for all i, 
and let A be any event. Then 

i 

Proof. We can write A as a disjoint union. 



A = (A n 5i) u (A n B 2 ) u • • • . 



It then follows from Lemma 1.3.1 that 



P{A) = P(A n P,) = P{A\Bi)P{B,). 

i i 



□ 



Prom the definition, we have that P{AnB) = P{A) 'P[B\A), whenever P{A) > 0. 
In the following exercise you are asked to extend this result. 

4 Exercise 1.4.8. Let Ai, A2, . . . , A^ be events such that P(Ai fi • • • nA^-i) > 0. 
Prove that 



P 



C\A 



P{A,)P{A2\A,)P{A3\A^nA2)--- 
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Example 1.4.9. Suppose that we have two urns, urn I containing two white balls 
and three blue balls, and urn II containing three blue balls. We pick a random 
ball from urn I, put it into urn II and after that pick a random ball from urn II. 
What is the probability that the second ball is blue? To answer this question, we 
can define the corresponding sample space Q as 

= {bb, bw^ wb, ww}, 

where bb refers to the outcome of two blue balls, etcetera. How do we assign 
probabilities to the outcomes in fl? At this point, it turns out that in order to 
assign probabilies to events, we in fact will need Theorem 1.4.7. Let A be the event 
that the final ball is blue, that is, A = {66, ^^;6}, and let B be the event that the 
first ball is blue, that is, B = {bb,bw}. Clearly, we want P{B) = 3/5. Also, when 
the first ball is blue, this blue ball is put into urn II which after that contains four 
blue balls. This means that the second ball must necessarily be blue, and we find 
that it should be the case that 



P{A\B) = 1. 

Similarly, we should have 

pi^m = I, 

since if the first ball is white, then urn II contains one white and three blue balls. 
Hence, Theorem 1.4.7 tells us that we have 

P{A) = P{A\B)P{B) + P{A\B^)P{B^) = 1 . ^ + L ? = 

which is the answer to the original question. We can now also assign probabilities 
to all outcomes. Indeed, we have 

P{bb) = P{A r\B) = P{A\B)P{B) ^ 

and similarly for the other outcomes. □ 

4 Exercise 1.4.10. Complete the assignment of probabilities in the last example 
by calculating P{bw), P{wb) and P{ww). Check that they sum up to 1. 

The last example contains an important lesson. In this example, we want 
something to be true for certain conditional probabilities, in this case for instance 
the probability that the second ball is blue, given that the first ball was blue. 
We were able to compute probabilities of events without first computing the full 
probability measure on simply by using the rules of Theorem 1.4.7. This is a 
very common situation. In many probabilistic problems, there is no need to worry 
about the sample space or the probability measure, and we can compute probabil- 
ities as in the last example. If you want, you can always, after that, compute the 
full probability measure, but we will not always do that in this book. The follow- 
ing example further illustrates this idea. The example is classical and potentially 
confusing. 
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Example 1.4.11. Suppose we have a population of people. Suppose in addition that 
the probability that an individual has a certain disease is 1/100. There is a test 
for this disease, and this test is 90% accurate. This means that the probability 
that a sick person is tested positive is 0.9, and that a healthy person is tested 
positive with probability 0.1. One particular individual is tested positive. Perhaps 
this individual is inclined to think the following: ‘I have been tested positive by a 
test which is accurate 90% of the time, so the probability that I have the disease 
is 0.9. ’ However, this is not correct, as the following computation shows. 

Let A be the event that this individual has the disease, and let B be the 
event that the test is positive. The individual is interested in the conditional 
probability P{A\B). The assumptions tell us that given that a person is sick, the 
test is positive with probability 0.9. Hence the assumptions tells us that P{B\A) = 
0.9, P{B\A^) = 0.1 and that P{A) = 0.01. To compute P{A\B) we proceed as 
follows: 



P{A\B) 



P{AnB) _ P{B\A)P{A) 

P{B) ~ P{B\A)P{A) + P{B\A^)P{A^) 



0.9 -0.01 + 0.1 -0.99 



Hence the probability that a person who is tested positively actually has the 
disease is equal to 0.09. This seems quite absurd, but the underlying reason is not 
hard to grasp. Since the disease is rare, there are many more people without the 
disease than with the disease. Therefore, it is much more likely that an individual 
is healthy and is tested positive, than that he or she is sick and the test is correct. 

□ 



The computation in the last example is an example of the following general 
result. 



Theorem 1.4.12. (Bayes’ rule) Let Bi, B 2 , . . . , Bn he a partition of Q such that 
P{Bi) > 0 for all i, and let A he any event with P{A) > 0. Then for all i, 



Proof. We write 



P{Bi\A) 



P{A\Bi)P{Bi) 



P{Bi\A) 



P{A n Bi) 

P(A) 

P{A\Bj)P{B,) 



according to the definition and Theorem 1.4.7. 



□ 



This rule can be used in a situation where we know the probabilities of the 
Bfs, and then learn something about the occurrence of an event A. Does this new 
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information change the probabilities of the Bi's? In other words, is P{Bi\A) differ- 
ent from P{Bi)7 Bayes’ rule gives a way to compute P{Bi\A). See the forthcoming 
Example 1.5.14 for another interesting example of applying Bayes’ rule. 



1.5 Independence 

Now that we have defined the notion of conditional probability, we can define 
what we mean by independent events. Clearly, for two events A and B to be 
independent, the probability of A should not change when we learn that B occurs. 
Therefore, we could define A and B to be independent whenever P{A) = P{A\B). 
However, this definition would not be completely satisfactory, for two reasons. 
First, P{A\B) would not be defined when P{B) = 0, and second, the definition 
would not be symmetric in A and B. Hence we define independence as follows. 

Definition 1.5.1. Two events A and B are said to be (mutual) independent when- 
ever 

P{Af^B) =P{A)P{B). 

More generally, the events Ai,A 2 ,...,An are called independent if 



\jeJ J jeJ 

for all index sets J C {1, 2, . . . , n}. 

Note that if P{B) > 0, A and B are independent if and only if P{A\B) = 
P{A). 

4 Exercise 1.5.2. Suppose that P{B) = 0 or 1. Show that B is independent of A, 
for any event A. 

4 Exercise 1.5.3. Suppose that A and B are independent. Show that A^ and B 
are also independent. Show that A^ and B^ are independent. Show that A and B^ 
are independent. 

The following example explains why in the definition of independence for 
more than two events, we need to require P(f]j^j Aj) = OjG J index 

sets J, and not only for such sets of size 2. 

Example 1.5.4. Consider the sample space 

Q = {123, 132, 111, 213, 231, 222, 312, 321, 333}, 

and define a probability measure by assigning probability 1/9 to each outcome. 
Define the event Ak as the event that the kth. digit is a 1, for /c = 1, 2 or 3. It is easy 
to check that P{Ak) — 1/3, and P{Ai fl A 2 ) = P{As fl A 2 ) = P{Ai fi As) — 1/9. 
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Hence, Ai and A 2 are independent, and the same is true for the other pairs. 
However, P{Ai D^2n A3) = 1/9 and this is not equal to P{Ai)P{A 2 )P{As). This 
means that the collection of events Ai,A2,As is not independent. Since every 
pair Ai, Aj is independent, we call the family of events pairwise independent. 
This example shows that a family can be pairwise independent without being 
independent. □ 

Example 1.5.5. This example is classical and bound to surprise you. Suppose we 
have r people in a room. We assume that their birthdays are equally likely to 
be any day of the year (which we assume to have 365 days, ignoring leap years). 
Furthermore, we assume that the events that the birthday of a particular person is 
on a particular day, are independent. What is the probability that no two persons 
in the room have the same birthday? 

We can order the persons from 1 to r, and convince ourselves, using Propo- 
sition 1.2.1, that there are (365) possible collections of birthdays. Now let the 
event that the birthday of person i is day y be denoted hy E(i,y), for i = 1, . . . , r 
and y — 1,. . . , 365. By independence, the probability of any outcome yi), 

is equal to the product of the individual probabilities: 

p ( n = nJ>(E(i,!/.)) = ( 3 ^)" ■ 

\i=l / i=l ^ ^ 

This means that all possible outcomes of this experiment have the same proba- 
bility and Proposition 1.1.10 applies. Hence we only need to count the number 
of outcomes in which no two birthdays coincide. How many outcomes have no 
common birthday? Well, there is no restriction on the first, but when we know 
the birthday of person 1, we have only 364 possibilities for the second, etcetera. 
Hence we conclude that the probability of having no common birthday is equal to 

365 • 364 • • • (365 - r -h 1) 

(365)^ ■ 

Now you can check that this probability is less than 1/2 for r = 23. The very 
surprising conclusion is that in a collection of 23 people, the probability that at 
least two of them have the same birthday is larger than 1/2. □ 

4 Exercise 1.5.6. A number of years ago, a mathematician explained the above 
example on television in a talkshow. The host of that particular show did not 
believe him. He looked around at the audience, about 500 people, and asked: 
‘My birthday is on the 23rd of November. Is there anyone else who has the same 
birthday?’ No one answered. The mathematician replied that this was very well 
possible, since he had only made a probabilistic statement. This answer, although 
not completely wrong, was very misleading. Do you see the correct answer to the 
point made by the host of the show? 
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4 Exercise 1.5.7. Before reading the next example, it is a good idea to try to 
prove the fact that for all real numbers a and b, and all n G N, we have 



(a + br = J2 



k=0 






Example 1.5.8. (General coin tossing) We generalise Example 1.2.8 by making 
the probability of a 1 arbitrary, say p. Hence each coin flip results in a 1 with 
probability p, and in a 0 with probability 1 - p. Clearly, we want the outcomes 
of different flips to be independent. Hence the probability that the first two flips 
both result in 1, should have probability p^. This reasoning leads to the conclusion 
that any outcome with k Is and n - k Os, should have probability p^(l - p)^~^. 
Does this make sense? I mean, is P thus defined indeed a probability measure? To 
check this, we compute 

k — 0 

= (p+ (1 -p))" = i> 

since there are (^) outcomes with exactly k Is, and the second identity follows 
from Exercise 1.5.7. 

What is the probability that the first 1 appears at the kth flip of the coin? We 
can compute this probability using the combinatorial ideas from Example 1.2.8. 
The event in question can be written as 

Ak = {uj ^ ^ - cdl = • ' ' — (^k-l == = !}• 



We can rewrite this in more familiar terms by defining the event Bi as the event 
that uji = 0. With this definition, we can write 



n ^2 n • • • n Bk-i n Bl 

To compute the probability of Ak^ we need to distinguish between members of Ak 
according to the number of Is among the last n — k positions. There are 
ways to put i Is among the last n — k positions, and any element in Ak that has 
i Is among the last n — k positions has probability p{l — p)^~^ x p^(l — 
Therefore, we And that 

P{A,) = 



according again to Exercise 1.5.7. 
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Figure 1.1: The first picture shows a network (solid lines) with its dual network 
(dashed lines). The second picture shows a realisation of a random network, to- 
gether with the associated realisation in its dual. Note that in the solid line net- 
work, there is a connection from left to right, while there is no top to bottom 
connection in the dual network. 



However, there is a better way to compute P{Ak). Note that we have con- 
structed the experiment in such a way that the events Bi are independent. Indeed, 
we built our probability measure in such a way that any outcome with k Is and 
n — k Os has probability p^(l — which is the product of the individual 

probabilities. Hence we see that 

P{Ak) = P(HinH2n---nH/e_inH^) 

Example 1.5.9. (Random networks) The theory that we discussed so far can some- 
times be used in a very surprising way. In this example we consider random net- 
works. Consider an (n + 1) x n rectangle, subdivided into n(n + 1) squares. See 
the solid line network in the first picture in Figure 1.1, where n = 3. Suppose now 
that for each side of a square, we flip a fair coin. When a head comes up, we delete 
the line segment, and when a tail comes up we leave it there. We do not flip coins 
for vertical edges on the boundary of the rectangle, that is why these are drawn 
with a different thickness. What results is a so-called random network and we can 
explore its properties. For instance, does the resulting network allow for a con- 
nection from the left side of the rectangle to the right side? (Now you understand 
why we do not flip coins for the vertical edges on the boundary; for the existence 
or non-existence of a connection from left to right, these edges do not matter.) It 
turns out that the probability of the event that such a connection exists does not 
depend on n. This is quite remarkable. Even more remarkable is that we can prove 
this with the little theory we have had so far. In order to state the result, we in- 
troduce some notation. As usual, we denote the sample space by ft. An element of 
Q can be written as u; = (u;(ei),a;(e 2 ), . . . ,ct;(e/c)), where the e^’s denote the edges 
of the network, and a;(e^) is equal to 1 when the edge is present in cj, and equal 
to 0 otherwise. The states of different edges are independent, that is, the events 
= 1 form an independent collection of events. Since each edge has proba- 
bility 1/2 of being present, each possible outcome has the same probability 




22 



Chapter 1 . Experiments 



We denote this probability measure by Pn- The event that the resulting network 
has a connection from left to right is denoted by LR{n). 

Theorem 1.5.10. It is the case that 



for all n. 

Proof. Since every outcome has the same probability, it is enough to show that the 
number of outcomes in which there is a connection from left to right, is the same as 
the number of outcomes for which there is no such connection. We will show this 
by constructing a one-to-one correspondence between outcomes with and without 
such a connection. To do this we introduce the notion of a dual network. 

To each network of size (n-h 1) x n we associate a so-called dual network. This 
is drawn in dashed lines in Figure 1.1, and obtained as follows. As is clear from the 
picture, a dual network is rectangular of size n x (n -h 1). Now consider a certain 
outcome of the experiment, for instance the outcome in the second picture in 
Figure 1.1. With each realisation of the random network, we associate a realisation 
in the dual network as follows. Each edge of the network crosses exactly one edge 
of the dual network. Whenever an edge e of the original network is not present in 
the outcome, the corresponding edge in the dual network - that is, the edge which 
crosses e - is taken out. Otherwise, the edge remains. Now note a most remarkable 
property. If the original network has a connection from left to right, then the dual 
network has no connection from top to bottom. If, on the other hand, the original 
network has no left-right connection, then there is a top to bottom connection in 
the dual. 

With this last observation, our one-to-one correspondence becomes almost 
obvious: to each outcome of the original network which has a left to right con- 
nection, we associate the corresponding outcome in the dual and then rotate it 
by 90 degrees. This gives the desired one-to-one correspondence, and finishes the 
proof. □ 

Example 1.5.11. In this example, we introduce a very important probability mass 
function on the sample space N. We let, for any A > 0, Pa probability mass 

function defined by 

( 1 . 1 ) 

for fc = 0, 1, 

4 Exercise 1.5.12. Prove that this is indeed a probability measure. You will need 
the exponential series for this: 



e 



X 



E 



k=0 



k\ ■ 
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We shall now explain why this probability mass function is introduced in a 
section devoted to independence. You should interpret this explanation as intuitive 
motivation for the probability mass function in (1.1). 

Suppose that we want to investigate arrivals of customers at a shop between 
time t = 0 and t = 1. As a first approximation, we could divide the unit time 
interval into n disjoint intervals of length 1/n, and make the assumption that in 
each time interval of length 1/n at most one customer can arrive. Also, we make 
the assumption that the probability that a customer does in fact arrive in a given 
interval is proportional to the length of that interval. Hence, we assume that there 
is A > 0 such that the probability that a customer arrives in an interval of length 
1/n is equal to A/n. Finally, we assume that arrival of customers in different time 
intervals are independent. With these assumptions we see that the probability that 
exactly k customers arrive in the interval from t = 0 to t = 1, should be the same 
as the probability that k heads come up when we throw n times with a coin which 
has probability A/n of being a head. That is, we have that this probability should 
be equal to 




k\{n — k)\ \ nj \ n) 



This converges to 



when n (X), using that 



n->oo y n j 

one of the standard limits from calculus. □ 

4 Exercise 1.5.13. Show that j(k) does indeed converge to ^e~^. 

Example 1.5.14. (Island problem) This problem was responsible for an interesting 
debate in the probability literature. Consider an island with n + 2 inhabitants. One 
of them is killed, and the murderer must be one of the inhabitants of the island. 
Police investigators discover a DNA profile at the scene of the crime. Scientists are 
able to say that this particular DNA profile occurs in a fraction p of all people. 
Now the police starts a big screening of all the inhabitants of the island. The first 
person to be screened, let’s call him John Smith, turns out to have this particular 
DNA profile. What is the probability that John Smith is the murderer? 



k\ \ n J n^(n — k)\ 



A^ 



A 



lim ( 1 - - ) = e 
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In order to say something about this, we need to make certain assumptions. 
Let us assume that each person has the DNA profile with probability p, indepen- 
dently of the other people. Here are two ‘solutions’ to the problem: 

(1) Apart from John Smith, there are n other people on the island who are the 
potential murderer. Suppose first that John Smith is the only one with the profile. 
In that case he is the murderer for sure. If there is exactly one other person with 
the profile, then the probability that John is the criminal is 1/2. In general, if 
there are k other people with the profile on the island, then the probability that 
John Smith is our man, is equal to l/(A:-h 1). Writing Ak for the event that exactly 
k other people have the same profile as John Smith, we have 

P(A)- (1.2) 

We then compute, writing G for the event that John Smith is the killer. 



P{G) = ^P{G\Ak)P{Ak) 



k=0 



n 

= y— 
^ 1+ 



/e=0 



yk\k 



n—k 



1 ^ 






p{n + 1) {k + l)!(n — k)\ 



(1-pY^ 



,+l)-(A:+l) 



1 



p{n -h 1) 
1 

p{n + 1) 



/ , -1 \ 

k=\ ^ ^ 

(l_(l_p)n+l). 



(2) We can, alternatively, apply Bayes’ rule. Theorem 1.4.12. Denote the event that 
John Smith is the murderer by G, and denote the event of finding John Smiths’ 
particular DNA profile at the scene of the crime by E. Before seeing this DNA 
profile, any inhabitant of the island is equally likely to be the murderer, and we 
therefore have P{G) = l/(n + l) and P{G^) — n/{n-\-l). We want to compute the 
probability of G after the DNA evidence E, that is, we want to compute P{G\E). 

If John Smith is the murderer, the event E occurs with probability 1, that 
is, P{E\G) = 1. If John Smith is not the murderer, then the real murderer has 
with probability p the same profile as John Smith, and therefore, P{E\G^) — p. 
We can now, with this information, compute the probability that John Smith is 
the murderer, given the event that his DNA profile was found at the scene of the 
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crime, that is, we can compute P{G\E): 

P{E\G)P{G) 

P{E\G)P{G) + P{E\G<^)P{G^) 

l/(» + l) 

l/{n + 1) + {pn)l{n + 1) 

1 

1 + pn 

This is alarming, since the two methods give different answers. Which one is cor- 
rect? We will now explain why method (1) is wrong. In method (1), we said that 
the probability that there are k other people with John Smiths’ profile is given 
by (1.2). This seems obvious, but is, in fact, not correct. The fact that the first 
person to be checked has the particular DNA profile, says something about the 
total number of individuals with this profile. The situation is very similar to the 
situation in Example 1.4.4. In that example, even when we know that a family 
has at least one boy, when we then actually see a boy opening the door, this new 
information does change the conditional probability that the family has two boys. 
The bare fact that a boy opened the door, makes it more likely that there are two 
boys. Similarly, the fact that the first person to be screened has the DNA profile, 
makes it more likely that there are more such persons. □ 

4 Exercise 1.5.15. Method (1) above can be made correct by taking into account 
the so-called size bias which we tried to explain above. Can you compute the right 
answer via this method? This is not so easy. 



P{G\E) = 



1.6 A First Law of Large Numbers 

One of the most basic intuitive ideas in probability theory is the idea that when we 
flip a fair coin very often, the fraction of heads should be roughly 1/2. A so-called 
law of large numbers is a mathematical formulation of this idea. We will encounter 
many laws of large numbers in this book. 

It is perhaps surprising that with a minimum of technical machinery, we can 
already state and prove the first of these laws of large numbers. We return to the 
setting of Example 1.2.8. Consider the event Bk that we see exactly k tails in n 
coin flips. For our present purpose, the number n is not fixed and we shall take 
a limit for n oc in a moment. This means that we should be careful with our 
notation, and we better express the dependence on n of the various quantities. 
Hence we denote the probability measure by Pn^ and the event of seeing exactly 
k tails by Bk^n^ this notation, recall that 



Pn{Bk,n) = 



2 "^. 
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The idea of the law of large numbers is that when n is large, the fraction of tails 
in the outcome should be close to 1/2. One way of expressing this is to say that 
the probability that this fraction is close to 1/2 should be large. Therefore, we 
consider the event that after n coin flips, the fraction of tails is between 1/2 — e 
and 1/2 + e. We can express this event in terms of the Bk^nS by 

[J Bk,n- 

n{^—e)<k<n{^-^e) 



Theorem 1.6.1. (Law of large numbers) For any e > 0, we have 
Pn 



I U ^ 

Vn(^— e)</c<n(i+e) / 



as n oo. 

Proof. We shall prove that 



and 



( U 

\n(^+e)<fc<n / 



0 , 



( U j 

Vo<fc<n(^— e) / 



(1.3) 



(1.4) 



as n ^ 00 . This is enough, since Pn{C'^=o^k,n) is equal to 1, and hence the 
probability of the union over all the remaining indices must converge to 1. We 
shall only prove (1.3), the proof of (1.4) is similar and left to you. 

First observe that 

Pn( U Bfe.d = Pn(Bk,n)^ ^ 

\k>n(h-\-e) / k>n( h-\-e) k>n( h-\-e) 



The following surprising trick is quite standard in probability theory. For A > 0, 
we And that 



E 

/c>n(i+e) 



/c>n(|+e) 



2~n ^ ^ gA(/c-n(J+e)) I ^ |2-n 



= e 



— Xne 



E 



/c>n(i+e) 



:)ii 









k=0 



n\ fl 



k V2 



^A/2 



-A/2 



n—k 



n—k 



= ( -e^/2 ^ lg-A/2 

2 2 
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according to Exercise 1.5.7. It is not hard to show that for all x G M, < x + ; 

see the forthcoming Exercise 1.6.2. Using this inequality, we find that the last 
expression is at most 

_ ^A^n/4-Ane^ 

Now we can find A to minimise the right-hand side, that is, A = 2e. This then 
finally yields the bound 




Pn U Bk,n < (1.5) 

\k>n{~-\-e) J 

which tends to zero when n tends to infinity. □ 

2 

4 Exercise 1.6.2. Prove that for all x G M, we have < x + . 



1.7 Exercises 

Exercise 1.7.1. In this exercise, A and B are events. Show that 

(a) {AnB^)U B = AU B] 

(b) {AuB)nB^ = A\B; 

(c) {AuB)n{An BY = {A\B) U {B\A); 

(d) A = {AnB)u{AnBY; 

(e) {AnB)u{AnC) = An{BU C); 

(f) {A[jB)n{AuC)=^Au{Bn C). 

Exercise 1.7.2. In this exercise. A, B and C are events. Express the following events 
in A, B and C\ 

(a) Only A occurs; 

(b) A and B occur, but C does not occur; 

(c) All three events occur; 

(d) Exactly one of the three events occurs; 

(e) At most one of the events occurs; 

(f) None of the three events occurs; 

(g) At least two of the three events occur. 

Exercise 1.7.3. Suppose we throw a die twice. What is the probability that 

(a) the two outcomes are the same? 

(b) the two outcomes are different and sum up to 8? 

(c) the sum of the outcomes is 10? 

Exercise 1.7.4. What is the probability that all four queens end up at one player 
in a play of bridge? 
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Exercise 1.7.5. An urn contains 7 red and 5 blue balls, which we take out (without 
looking!), one by one. 

(a) What is the probability that the first ball is blue? 

(b) What is the probability that the last ball is blue? 

(c) What is the probability that the last ball is blue, given that the first ball is 
blue? 

Exercise 1.7.6. Consider four boxes, numbered 1 to 4. We throw four balls in the 
boxes in such a way that each ball ends up in any particular box with probability 
1/4. 

(a) What is the probability that there will be at least one empty box? 

(b) What is the probability that there is exactly one empty box? 

(c) What is the probability that box 1 remains empty? 

Exercise 1.7.7. Suppose there are three men and four women who have to be 
arranged in a circle. If we do this randomly, what is the probability that all men 
stand together in one group? 

Exercise 1.7.8. Consider a group of four people. Everybody writes down the name 
of one other (random) member of the group. What is the probability that there is 
at least one pair of people who wrote down each others name? 

Exercise 1.7.9. Suppose that we play bridge. Each player receives 13 cards. What 
is the probability that south receives exactly 8 spades, and north the remaining 
5? 

Exercise 1.7.10. Suppose we dial a random number on my telephone, the number 
is six digits long. What is the probability that 

(a) the number does not contain a 6; 

(b) the number contains only even digits; 

(c) the number contains the pattern 2345; 

(d) the number contains the pattern 2222. 

Exercise 1.7.11. Suppose that we order the numbers 1,2, ...,n completely ran- 
domly. What is the probability that 1 is immediately followed by 2? 

Exercise 1.7.12. We choose an integer N at random from {1,2,..., 10^}. What is 
the probability that N is divisible by 3? by 5? by 105? How would your answer 
change if 10^ is replaced by 10^, as k gets larger and larger? 

Exercise 1.7.13. Consider the complete graph K 4 with four vertices; all vertices are 
connected by an edge to all other vertices. Suppose now that we flip an unbiased 
coin for each edge. If heads comes up, we leave the edge where it is, if tails comes 
up we remove the edge. 

(a) What is the probability that two given vertices are still connected after the 
removal of the edges? 

(b) What is the probability that the graph remains connected? 

(c) What is the probability that a given vertex becomes isolated? 
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Exercise 1.7.14. Suppose that we have a tennis tournament with 32 players. Players 
are matched in a completely random fashion, and we assume that each player 
always has probability 1/2 to win a match. What is the probability that two given 
players meet each other during the tournament. 

Exercise 1.7.15. Peter and Paul have a disagreement, and they want to make a 
decision by throwing a coin. Paul suspects that the coin is biased. Design a rule 
so that they can come to a fair decision. 

Exercise 1.7.16. Consider n pair of shoes. Suppose that we take 2r of these (with- 
out looking of course), where 2r < n. What is the probability that there is no pair 
among these 2r shoes? Can you also compute the probability that among these 2r 
shoes, there is exactly one pair? 

Exercise 1.7.17. Consider two urns, one containing 5 red and 10 white balls, and 
the other with 5 white and 10 red balls. Now choose one of the urns randomly, and 
take two random balls from the chosen urn. Let A be the event that the first ball is 
red, and B be the event that the second ball is white. Are A and B independent? 

Exercise 1.7.18. Suppose that we send a message using some coding so that only 
O’s and I’s are sent. On the average, the ratio between the number of O’s and 
I’s that are sent is 3/4. As a result of problems with the connection, each 0 sent 
is received as a 1 with probability and each 1 sent is received as a 0 with 
probability |. Compute the probability that a symbol received as a 1, was also 
sent as a 1. 

Exercise 1.7.19. Three people. A, B and C play a game in which they throw coins, 
one after the other. A starts, then B, then C, then A again, etcetera. The person 
who throws heads first, wins the game. Construct an appropriate sample space for 
this game, and find the probability that A wins the game. 

Exercise 1.7.20. Suppose that 20 rabbits live in a certain region. We catch 5 
of them, mark these, and let them go again. After a while we catch 4 rabbits. 
Compute the probability that exactly 2 of these 4 are marked. Be very precise 
about the assumptions that you make when you compute this probability. 

Exercise 1.7.21. Suppose you want to ask a large number of people a question to 
which the answer ‘yes’ is embarrassing. As the question is asked, a coin is tossed, 
out of sight of the questioner. If the answer would have been ‘no’ and the coin 
shows a head, then the answer ‘yes’ is given. Otherwise people respond truthfully. 
Do you think that this is a good procedure? 

Exercise 1.7.22. A fair coin is tossed n times. Let Bn and Tn denote the number 
of heads and tails among these n tosses. Show that, for any e > 0, we have 




as n 



oo. 
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Exercise 1.7.23. (Simpson’s paradox) A researcher wants to determine the relative 
efficacies of two drugs. The results (differentiated between men and women) were 
as follows. 



women 


drug I 


drug II 


success 


200 


10 


failure 


1800 


190 



men 


drug I 


drug II 


success 


19 


1000 


failure 


1 


1000 



We are now faced with the question which drug is better. Here are two possible 
answers: 

(1) Drug I was given to 2020 people, of whom 219 were cured. Drug II was given 
to 2200 people, of whom 1010 were cured. Therefore, drug II is much better, 

(2) Amongst women, the success rate of drug I is 1/10, and for drug II the success 
rate is 1/20. Amongst men, these rates are 19/20 and 1/2 respectively. In both 
cases, that is, for both men and women, drug I wins, and is therefore better. 

Which of the two answers do you believe? Can you explain the paradox? 

Exercise 1.7.24. Suppose that we want to distribute five numbered balls over 
three boxes I, II and III. Each ball is put in a random box, independently of the 
other balls. Describe an appropriate sample space and probability measure for this 
experiment. Compute the probability that 

(a) box I remains empty; 

(b) at most one box remains empty; 

(c) box I and II remain empty. 

Exercise 1.7.25. An urn contains 10 white, 5 yellow and 10 black balls. We pick a 
random ball. What is the probability that the ball is yellow, given that it is not 
black? 

Exercise 1.7.26. It is known that 5% of the men are colour blind, and of 
the women are colour blind. Suppose that there are as many men as women. We 
choose a person, who turns out to be colour blind. What is the probability that 
this person is a man? 

Exercise 1.7.27. We choose a random card from a deck of 52 cards. Let A be the 
event that the card is a queen, and B be the event that it is a spade. Show that 
A and B are independent. 

Exercise 1.7.28. Suppose that we have a very special die, namely with exactly k 
faces, where A: is a prime. The faces of the die are numbered 1, . . . , /c. We throw 
the die and see which number comes up. 
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(a) What would be an appropiate sample space and probability measure? 

(b) Suppose that the events A and B are independent. Show that A and B can 
only be the full sample space, or the empty set. 

Exercise 1.7.29. Consider a family that we know has two children. I tell you that 
it is not the case that they have two girls. What is the (conditional) probability 
that they have two boys? 

Now suppose that I suddenly see the father of this family walking with a 
little boy holding his hand. Does this change the conditional probability that the 
family has two boys? 

Exercise 1.7.30. Suppose we throw a fair coin n times. Let A be the event that we 
see at most one head, and let B be the event that we see at least one head and 
at least one tail. Show that A and B are independent when n = 3 and show that 
they are not independent when n ^ 3. 

Exercise 1.7.31. Suppose that I have two coins in my pocket. One ordinary, fair 
coin, and one strange coin with heads on either side. I pick a random coin out of 
my pocket, throw it, and it turns out that heads comes up. 

(a) What is the probability that I have thrown the fair coin? 

(b) If I throw the same coin again, and heads comes up again, what is the proba- 
bility that I have thrown the fair coin? 

Exercise 1.7.32. A pack contains m cards, labelled 1, 2, . . . , m. The cards are dealt 
out in a random order, one by one. Given that the label of the kth card dealt is 
the largest of the first k cards, what is the probability that it is also the largest in 
the whole pack? 

Exercise 1.7.33. (de Mere’s paradox) Which of the following two events has the 
highest probability: 

(1) at least one 6, when we throw a die 4 times; 

(2) at least one double 6, when we throw two dice 24 times. 

Exercise 1.7.34. Let Ai, A 2 , ... be events. Show that 

( n \ n 

>^p(A,)-(n-l), 

2=1 / 2=1 



for all n == 1, 2, — 

Exercise 1.7.35. Consider the following game: player I flips a fair coin n + 1 times; 
player II flips a fair coin n times. Show that the probability that player I has more 
heads than player B, is equal to Is this counterintuitive, given the fact that 
player I flips the coin one extra time? 

Exercise 1.7.36. Consider two events A and B, both with positive probability. 
When is it the case that P{A\B) = P{B\A)1 
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Random Variables emd 
Random Vectors 



It often happens that we do not really care about the outcome of an experiment 
itself, but rather we are interested in some consequence of this outcome. For in- 
stance, a gambler is not primarily interested in the question whether or not heads 
comes up, but instead in the financial consequences of such an outcome. Hence 
the gambler is interested in a function of the outcome, rather than in the outcome 
itself. Such a function is called a random variable and in this chapter we define 
and study such random variables. 



2.1 Random Variables 

Suppose that we flip a coin n times. Each time tails comes up we lose one euro, 
and when heads comes up we win one euro. The sample space corresponding to 
this experiment could be {—1,1}’^, the set of sequences of —Is and Is of length n. 
How can we express our fortune after n flips? 

Writing uj = {cui,. . . ,Un) as usual, where = the ith flip is a head, my 
winnings after n flips is equal to 



n 

s{uj) = 

2=1 

Thus our winnings constitute a mapping S : Q R. Such mappings are called 
random variables. 

Definition 2.1.1. A random variable X is a mapping from a sample space Q into 
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Figure 2.1: A sketch of the distribution function of X. 



Typically, we denote random variables with letters near the end of the al- 
phabet, like X, V and Z. We are often interested in the probability that a random 
variable takes certain values. Hence we are interested in 



F({uj:X(uj)=x}), 

for all appropriate x. We shall often write {X = x} for {ou : X{ou) = x} and 
P{X = x) for P{{cu : X{cu) = x}). 

Definition 2.1.2. The probability mass function of a random variable X is the 
function px ^ [0, 1] given by 

Px{x)=P{X = x). 



Definition 2.1.3. The distribution function of a random variable X is the function 
Fx : M ^ [0, 1] given by 

Fx{x) = P{X <x). 

Example 2.1.4. Suppose that the random variable X takes the value 1 with prob- 
ability 1, that is, P{X = 1) = 1. The distribution function is then given by 
Fx{x) = 1, for X > 1, and Fx{x) = 0 for x < 1. Note that this function is 
continuous from the right but not continuous from the left at x = 1. □ 

Example 2.1.5. Consider a random variable X with P{X = 1) = 1/2, P{X = 2) = 
1/4 and P{X = 3) = 1/4. The distribution function of X is given by 



see Figure 2.1. 



Fx{x) 



^0 if X < 1, 

1/2 if 1 < X < 2, 
' 3/4 if 2 < 2 : < 3, 
1 if X > 3, 



□ 



Example 2.1.6. Suppose that we toss a fair coin n times. The number of heads is 
a random variable which we denote by X. The probability mass function of X is 
given by 




Px{k) = 
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for /u = 0, . . . , n, and px{k) = 0 for all other values of k. Hence its distribution 
function is given by 



Fx{x) = 2-^ 




for 0 < X < n; Fx{x) = 0 for x < 0; Fx{x) = 1 for x > n. 



□ 



Example 2.1.7. (Binomial distribution) A random variable X is said to have a 
binomial distribution with parameters n G N and p G [0, 1] if 



for k = 0, 1, . . . , n. We have seen examples of such random variables in Chapter 
1 when we discussed coin flips. The random variable X represents the number of 
heads when we flip a coin n times, where each flip gives heads with probability p. 
In particular, we can write such a random variable X as a sum 

n 

2=1 

where Yi — 1 \i the ith flip yields a head, and = 0 otherwise. □ 

Example 2.1.8. (Poisson distribution) A random variable X is said to have a 
Poisson distribution with parameter A > 0 if 

P{X = t) = p-\ 

for fc = 0, 1, . . .. We have come across this distribution already in Example 1.5.11, 
where it naturally appeared as a limit of binomial random variables. □ 

Example 2.1.9. (Geometric distribution) A random variable X has a geometric 
distribution with parameter p G [0, 1] if 

P{X = k) = p{l - 



for A: = 1, 2, We have seen a random variable with this distribution in Example 

1.5.8. We can view X as representing the waiting time until the first time heads 
comes up in a sequence of independent coin flips. 

There is one subtle thing that needs attention at this point. We interpreted 
a geometric random variable as the waiting time for the first head to come up in a 
sequence of coin flips. This suggests that we want to define X on a sample space 
which corresponds to infinitely many coin flips. Indeed, the first head may come 
up at any time: there is no bound on the time at which the first head comes up. 
However, all our sample spaces so far contained at most countably many points, 
and we shall see later that any sample space which represents infinitely many 
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coin flips is necessarily uncountable. Is this a problem? In some sense yes, but in 
another sense no. If we flip a coin, and we are only interested in the time that the 
first head comes up, then it is enough to take {1, 2, . . .}u{oo} as our sample space. 
An outcome k then corresponds to the event that the first head appears at the fcth 
flip, and the outcome oo corresponds to the event that we never see a head. This 
sample space clearly does not capture all events in the coin flips. For instance, 
an outcome X = 6 says nothing about the 7th, 8th,. . . coin flip. But if we only 
want to know something about the time of the first head, then this is no problem. 
Hence, the choice of the sample space depends on the question or object you want 
to study. When it comes down to defining an appropriate probability measure for 
this experiment, we have seen that it is reasonable to take P{k) = (1 — 

for fc = 1,2, What should P{oo) be? Well, since YlT=:i ~ there is no 

probability mass left for oo, and we conclude, a fortiori^ that P(oc) = 0. That 
is, in this model it is certain that we will throw a head at some point. This is of 
course in complete agreement with our probabilistic intuition. □ 

Example 2.1.10. (Negative binomial distribution) The random variable X has a 
negative binomial distribution with parameters r G N and p G [0, 1] if 

for A; = r, r + 1, — This looks rather complicated. However, we can interpret this 
as the waiting time until heads comes up for the rth time in a sequence of coin flips. 
To see this, note that in order to see heads for the rth time at the kth flip, we must 
have seen exactly r — 1 heads among the first k — 1 flips. This can be realised in 
{^Z\) ways. The probability that this happens for a particular realisation with r — 1 
heads among the first A: — 1 flips is simply 

The requirement that the A:th flip is also a head gives an extra factor p, and then 
we arrive at the given formula. 

Observe that we can split the waiting time until the rth head into r pieces, 
each piece consisting of the waiting time until the next heads appears. This makes 
it clear that a random variable X with a negative binomial distribution with 
parameters r and p, can be seen as the sum of r geometrically distributed random 
variables. □ 

4 Exercise 2.1.11. Show that the probabilities of the negative binomial distribu- 
tion sum up to 1. 

It is useful to list a number of properties shared by all distribution functions. 

Theorem 2.1.12. A distribution function F of a random variable X has the fol- 
lowing properties: 

(a) lima;_4oo F{x) = 1; 

(b) lima,^_oo F{x) == 0; 

(c) F is non- decreasing; 
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(d) F is continuous from the right; 

(e) P(X >x) = l- F(x); 

(f) P{x<X<y) = F{y) - F{x); 

(g) P{X = x) = F{x) - F{y). 

In the proof of this theorem, we need the following lemma, which is of inde- 
pendent interest. 

Lemma 2.1.13. (a) Let Ai C A 2 C Q - • , and let A = U^i Then 

P{A) = lim P{Ai). 

i^oo 

(b) Let Bi 2 -B 2 2 ^3 2 • • • , and let B = HSi Then 

P{B) = lim P{Bi). 

2 — >00 

Proof, (a) We can write A AiU (^ 2 \^i) U (As\A 2 ) U • • • as a union of pairwise 
disjoint events. Hence, 

OD 

P{A) = P{Ai) + P{Ai+i\Aj) 

i=l 

n—1 

= P{A,)+ lim V (P(A,+i) - P(^0) 

n— >00 ' 

i=l 

= lim {P{A,) + P{A 2 ) - P{Ai) + P{A 3 ) - P{A 2 ) + ■■■ 

n— »-oo 

+P(A„)-P(A_i)}= lim P{An). 

n—^oo 



(b) See Exercise 2.1.14. □ 

Exercise 2.1.14. Prove Lemma 2.1.13(b). 



Proof of Theorem 2.1.12. For (a), let Af be the event that X <i. Then, as i 00 , 
Ai t and the statement follows from Lemma 2.1.13(a). (b) is proved similarly 
and left as an exercise. To prove (c), note that when x < y, {X < x} C {X < y}. 
The statement now follows from Lemma 1.3.1(c). 

To prove (d), note that 






00 ✓ 

^-}=n{ 

n=l ^ 



X<x 



n 



It then follows from Lemma 2.1.13(b) that 

= P(X <x)= lim Pfx<x+- I 

n^oo \ n ) 

= limP(x+-). 

n— >00 n 



F{x) 
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Since F is a monotone function, lim/^io F{x + h) exists and by the previous com- 
putation, this limit must be F{x). 

The statement in (e) follows from Lemma 1.3.1(b). The proof of (f) is left as 
an exercise. 

For (g), observe that 



oo 

{X = x} = Pi {x - l/n < X < x} . 

n=l 

Hence, we have according to Lemma 2.1.13(b) that 



P{X = x) 



lim P{x — l/n < X < x) 

n— >oo 



lim (F(x) - F{x — l/n)) 



F{x) — lim F{x — l/n). 

n^oo 



□ 



Exercise 2.1.15. Complete the proof of Theorem 2.1.12. 

^ Exercise 2.1.16. Suppose that X has distribution function F. What is the dis- 
tribution function of the random variable Y defined by Y = aX + 6? 

We end this section with a statement to the effect that the probability mass 
function of a random variable is uniquely determined by its distribution function, 
and vice versa. 

Theorem 2.1.17. Two random variables have the same probability mass function 
if and only if they have the same distribution function. 

Proof Let X and Y be such that px{^) = Py{x) for all x. Then 

Fxix) = P{X<x)= px{y) 

y:y<x 

= Y 

y:y<x 

Conversely, suppose that Fx{x) = Fy{x) for all x. Then it follows immediately 
from Theorem 2.1.12(g) that px{^) = Py{^) for all x. □ 

Definition 2.1.18. If two random variables X and Y have the same probability 
mass function or, equivalently, the same distribution function, then we say that X 
and Y have the same distribution. Asking for the distribution of a random variable 
is asking for either its probability mass function, or its distribution function. 
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2.2 Independence 



I have heard it said that probability theory is just analysis plus the concept of in- 
dependence. I am not sure whether this makes any sense, but it certainly expresses 
the idea that independence plays a crucial role in probability theory. 

We want two random variables to be called independent, if knowledge of the 
outcome of the first has no effect on the distribution of the second. Here is the 
formal definition. 

Definition 2.2.1. The random variables Xi,X 2 , . . . ,Xn are called independent if 
the events {Xi = xi}, . . . , {Xn = Xn} are independent for any choice of xi, . . . , 

The concept is best illustrated with some examples. 

Example 2.2.2. (General coin tossing) Consider the general coin tossing in Exam- 
ple 1.5.8. Let X{lo) = cui and Y{uj) = That is, X is the outcome of the first 
coin flip, and Y is the outcome of the second. We already showed in Chapter 1 
that X and Y are independent. □ 

Example 2.2.3. Suppose we flip a coin a random number N of times, where 

P{N = n)=e-^^, n = 0,l,2,..., 

n! 

that is, N has a Poisson distribution. The flips are independent, and the probability 
of seeing heads is equal to p. Hence if we know that N = n, the number of heads is 
simply the number of heads in n coin flips with success probability p. We write X 
for the total number of heads, and Y for the total number of tails, so X + Y = N. 
The above conclusion can now be formalised as follows: 

P{x = x\N = n)= 



Are X and Y independent? It is perhaps surprising that they are. To see this, we 
compute 

P{X = x,Y = y) = P{X = x,Y = y\N = X + y)P{N = X + y) 



x + y 



\x+y 



X y ' {x + y)\ 

{xpnx{i-p))y 

xly] 



On the other hand, we have 

P(X = x) = Y, 



n>x 



n>x 






{XpY 



-Xp 
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and a similar computation gives 



P(Y = y) 



y\ 



Hence, P{X = x^Y = y) = P[X = x)P{Y = y), and we conclude that X and Y 
are independent. □ 

^ Exercise 2.2.4. Show that X and Y are independent if and only if for all 



P{X <x,Y <y) = P{X < x)P{Y < y). 

When the random variables Xi , X 2 , . . . , Xn are independent, and we consider 
functions of these random variables, these functions are also independent. For 
instance, X^ and X 2 — 7 are also independent. This is articulated in the following 
theorem. 



Theorem 2.2.5. Let Xi,X 2 ,...,X„ he independent random variables, and let, 
for i = l,...,n, Qi be a function pi : R ^ R. Then the random variables 
gi{Xi),g 2 {X 2 ), ...,gn{Xn) dre also independent. 

Proof. We give the proof for the case n = 2, the full proof being asked for in the 
next exercise. We write 

P{gi{X,) = au92{X2) = a2) = P{X, = xi,X2 = X 2 ) 

(^1 ■>^2^ -9i 

P{X, = Xi)P(X 2 = X 2 ) 

{xi ,X2')'9i{Xi') = CLi 

Y, P{Xi = xi) X 

xi:gi{xi)=ai 

X ^ P(X 2 = X2) 

X2-92{x2) = a2 

= P{9i{X,) = a^)P{92{X2) = U2). □ 

4 Exercise 2.2.6. Give the full proof of the last result. 

4 Exercise 2.2.7. Let Xi,X 2 and X^ be independent random variables, and let 
^ ^ R and h : R ^ R. Show that g{Xi,X 2 ) and h{Xs) are independent 

random variables. Generalise this result. 



2.3 Expectation and Variance 

The concept of expectation refers to the ‘average value’ taken by a random variable. 
The expectation of a random variable will be defined as a sum, possibly with 
infinitely many terms, and it pays to spend a few lines about such sums. 
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Let ai,a 2 , . . . be real numbers. We only want to define when this 

sum does not change when we change the order of the a^’s. A classical result from 
calculus tells us that this is in particular the caise when 

1. all terms have the same sign, or 

2. both S+ = En:a„>0 «« ^nd S- = - E„:a„<0 “n finite. 

If the numbers ai, a 2 , . . . satisfy the conditions in (1) or (2), then we say that the 
sum is well defined. Finally, we agree from now on that oc — oc is not 

defined. 

Definition 2.3.1. The expectation of a random variable X is given by 

E{X) = Y,xP{X = x), 

X 



whenever this sum is well defined. 

Why is this definition reasonable? At the beginning of this section, we said 
that the expectation refers to the average value taken by a random variable. We 
will now explain why this is the case with the above definition. 

Let xi, . . . , x/e be the outcomes of k independent random variables with the 
same distribution as some random variable A, and let, for each m, km be the 
number of xfs which take the value m. Then we have that Xi = 

and therefore the average ^ is equal to As already 

mentioned in the previous chapter, the quotient km/k should, for large A;, be 
close to P{X = m), and the average of the x^’s should therefore be close to 
= m). This is precisely the definition of the expectation of X. 

Example 2.3.2. Let A be a random variable with P{X = 0) = 1/2, and P{X = 1) 
= 1/2. Then E{X) = 1/2. This shows that the expectation of a random variable 
A need not be a value which can be taken by A. □ 

Example 2.3.3. Suppose that A takes values —1, 1 and 2, with equal probability, 
and consider the random variable Y = X^. Clearly, Y takes values 1 and 4 with 
probability | and | respectively. Therefore E{Y) = l- |+ 4- | = 2. □ 

Here follows an example with an infinite expectation. 

Example 2.3.4. (St. Petersburg paradox) Suppose that you go to a casino, and 
that you can play the following game. A random number A is chosen in the casino, 

in such a way that P{X = 2'^) = 2“^^, for n = 1,2, The player receives this 

amount A from the casino. Of course, in order to play this game, you need to pay 
beforehand. What would be the fair ‘entry fee’ for this game? In other words, how 
much money would you be willing to pay in order to play this game? Perhaps you 
want to base this amount on the expectation of A. The idea would be that the 




42 



Chapter 2. Random Variables and Random Vectors 



expectation of X is the average amount of money that you receive, and it would be 
only fair to pay exactly this amount in advance, making the game fair. However, 
it is clear that 

oo 

E{X) = = 00, 

n=l 

and you would not be willing to pay an infinite amount of money to play this 
game. D 

In Example 2.3.3, we computed the expectation of Y by first computing the 
probability mass function of Y. In this case, it was fairly easy, but in general this 
may not be so straightforward. Therefore, it would be nice if we could compute 
the expectation of a function of X by only using the probability mass function of 
X itself. This is indeed possible, and articulated in the following lemma. 

Lemma 2.3.5. Let X he a random variable, and g :R Then the expectation 

of g{X) is given by 

E{g{X))=J29{x)P{X = x), 

X 

whenever this last sum is well defined. 

Proof. Since the sum is well defined, we are free to change the order of summation. 
We can now write 

E{g{X)) = Y.yP{g{X) = y)^Y.y E P{X = x) 

y y x:g{x)=y 

= E E yP{x = x) = Y^ E 9i^)P{x = x) 

y x\g{x)=y y x:g(x)=y 

= Y^g{x)P{X = x). ^ 

X 

Example 2.3.6. For the random variable Y in Example 2.3.3, this result leads to 
the following computation: E{Y) = x‘^P{X = x) = l- ^ + l- |+ 4- ^ = 2. □ 

4 Exercise 2.3.7. Show that if X takes only non-negative integer values, we have 

oo 

E(X) = ^P(X>n). 

n=0 

For two random variables X and Y on the same sample space, we can define new 
random variables X -\-Y, XY, etcetera via 

{X + Y){uj) = X{u) + Y{u), and {XY){u;) = X{lo)Y{uj). 

Expectations turn out to behave very nicely under taking sums of random vari- 
ables. This will make it quite easy to compute the expectation of various random 
variables, as we shall see in the next section. 
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Theorem 2.3.8. Let X and Y be two random variables defined on the same sample 
space. If E{X) and E{Y) are not both infinite with opposite signs, then 

E{X + Y) = E{X)eE{Y). 

Proof. First, 

E{Y) = J]jP(y = j) = = = 

3 3 i 

i 3 

and therefore, 

E{X) + E{Y) = = + = = 

i i j 

i j i 3 

= EE(*+j)^(^ = *’^=j) 

i 3 
^ 3 

= E^E^(^+^ = ^>^=j) 

^ 3 

= ^zP{X + Y = z). 

Z 

It follows that the last sum is well defined and hence E{X + Y) exists and is equal 
to E{X) + E{Y). □ 

The conditions of the theorem are not necessary for the existence of E{X-\-Y). 
For instance, when the random variable X satisfies E{X) = oo, and Y = —X, 
then E{X) + E{Y) is not defined, but E{X + E) = 0. 

One of the most common misconceptions in probability theory is the idea that 
Theorem 2.3.8 should only be true when X and Y are independent. However, the 
preceding calculation shows that the result has nothing to do with independence. 

4 Exercise 2.3.9. Extend Theorem 2.3.8 to more than two random variables. 

4 Exercise 2.3.10. Show that E{aX b) = aE{X) + 6, for random variables X 
and real numbers a and b. 

Instead of sums, we also need to consider products of random variables. It 
turns out that for products, independence does play a crucial role. 
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4 Exercise 2.3.11. Find two random variables X and Y so that 

E{XY) ^ E{X)E{Y). 

Theorem 2.3.12. If the random variables X and Y are independent and E(X) and 
E{Y) are finite, then E{XY) is well defined and satisfies 

E{XY) = E{X)E{Y). 



Proof. We write 



Y,lP{XY = l) 

I 



= k.Y = i) 

I k 

= 1 ) 

k I 

^^iP(x = t)F{r = j) 

k I 



Y,kP(X = k)Y,'-rPiy = {) 



E{X)E{Y). 



Hence the sum in the first line is well defined, and is therefore equal to E{XY). □ 

It is not true that E{XY) — E{X)E{Y) implies that X and Y are indepen- 
dent: 



4 Exercise 2.3.13. Let X and Y be independent random variables with the same 
distribution, taking values 0 and 1 with equal probability. Show that 

E{{X + Y){\X - F|)) E{X + Y)E(\X - Y\), 
but that X + Y and \X — Y\ are not independent. 

The expectation of a random variable X can be thought of as the average 
value taken by X. But clearly, this expectation does not say very much about the 
way the possible outcomes of X are spread out: 

Example 2.3.14. Consider a random variable X with P{X = 100) = 1/2 and 
P{X = —100) = 1/2, and a random variable Y with P{Y = 1) = 1/2 and 
P{Y = —1) = 1/2. Clearly both E{X) and E{Y) are equal to zero, but it is also 
clear that outcomes of X will always be very far away from this expectation, while 
outcomes of Y will be much closer. □ 



The variance of a random variable is intended to measure to what extent 
the outcomes of a random variable are spread-out. One way of doing this is to 
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look at the ‘average’ deviation from the expectation, that is to the expectation of 
\X — E{X)\. It turns out however, that it is more convenient to look at the square 
{X - E{X))^ instead. 

Since it does not really make sense to talk about deviation from the expec- 
tation when this expectation is infinite, in the rest of this section, we assume that 
the expectations of the random variables involved are all finite. 

Definition 2.3.15. Let X have finite expectation p. The variance var(X) of X is 
defined as 

var(X) = E{{X- - yifP{X = x). 

X 

The standard deviation cf{X) of X is defined a^ the square root of the variance, 

a{X) = ^/v^x{X). 

Lemma 2.3.16. var(X) = E{X‘^) - {E{X)f. 

Proof. We write E{X) = p. Then we have 

var(X) = E{X^ - 2pX + p^) 

= E{X^) - 2pE{X) -h p^ 

= E{X‘^) - 2p^ + p^ = E(X^) - p^. □ 

Example 2.3.17. If X takes the values 1 and 0 with probability p and 1 — p re- 
spectively, then E{X) = p, E{X‘^) = p and therefore var(X) = p — p^^ using the 
last lemma. □ 

Theorem 2.3.18. Let X and Y be random variables, and a, 6 G M. 

(a) vav{aX Pb) = a^var(X). 

(b) var{X + Y)= var(X) + var(F) + 2{E{XY) - E{X)E{Y)). In particular, if X 
and Y are independent, then 

yai{X -h y) = var(X) -h var(T). 

Proof, (a) Let E{X) = p. We then write 

var(aX + b) = E{{aX -h b)^) - {E{aX + b))^ 

= E{a^X‘^ + 2abX + 6^) — {ap + b)^ 

= a‘^E{X‘^) + 2abp + 6^ — a^p^ — 2abp — b^ 

= a^E{X^)-a^p^ = a^Yai{X). 

(b) 

var(X-hT) = E{{X + Y - E{X + Y)f) 

= E{{X - E{X)f + {Y- E{Y)f + 2{XY - E{X)E{Y))) 

= var(X) -h var(y) + 2{E{XY) - E{X)E{Y)). □ 
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The quantity E{XY) — E{X)E{Y) which appears in (b) is called the covari- 
ance of X and T, and denoted by cov(X, F). 

4 Exercise 2.3.19. Show that coy{X,X) = var(X). 

The expectation and variance of a random variable are frequently used ob- 
jects. There exists a number of useful inequalities which are based on them. 

Theorem 2.3.20. Let X be a random variable taking only non-negative values. 
Then for any a > 0 we have 



P{X >a)< ^E{X). 

Proof. 

E{X) = y^a;P(X = a;) 

X 

> xP{X = x)>a Y 

x:x>a x:x>a 

= aP{X > a). □ 



^ Exercise 2.3.21. Can you point out exactly where we have used the assumption 
that X takes only non-negative values? 

Applying this result to |X|^ leads to 

Corollary 2.3.22. (Markov’s inequality) 

P{\X\>a)<~E{\Xn 

Markov’s inequality for /c = 2 and applied to \X — E{X)\ leads to 

Corollary 2.3.23. (Chebyshev’s inequality) 

P{\X - E(X)\ >a)< ^var(X). 

4 Exercise 2.3.24. Show how the inequalities of Markov and Chebyshev follow 
from Theorem 2.3.20. 

Finally, we mention the famous Cauchy-Schwarz inequality. 

Theorem 2.3.25. (Cauchy-Schwarz inequality) For any random variables X and 
Y for which E{XY) is defined, we have 



E{XY) < ^/e{X^)E(X^. 
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Proof. The proof of this result is quite surprising. If E{X‘^) = oo or E{Y‘^) = oo, 
then there is nothing to prove. If they are both finite, then we first claim that 
E{XY) is well defined. To see this, observe that \xy\ < + y‘^), for all real x 

and y. 

^ Exercise 2.3.26. Show that this last observation leads to the claim. 

Also, if E{X‘^) or E{Y‘^) is zero, then X or T takes the value zero with 
probability 1, and there is nothing to prove. Hence we can assume that the right- 
hand side is positive. Now let a be a real number and define Z = aX — Y . Then 

0 < E(Z2) = a'^E{X'^) - 2aE{XY) + E{Y'^). 

The right-hand side can be seen as a quadratic equation in the variable a. Since this 
quadratic expression is apparently non-negative, it follows that the corresponding 
discriminant is non-positive. That is, we have 

{2E{XY)f - 4E{X^)E{Y'^) < 0, 

which is what we wanted to prove. □ 

4 Exercise 2.3.27. Show that we have equality in the Cauchy-Schwarz inequality 
if and only if P{aX = T) = 1 for some a, or X = 0 with probability 1. 

Here follows a number of examples of how to compute expectations and 
variances. 

Example 2.3.28. (Binomial distribution) Recall that a random variable X is said 
to have a binomial distribution with parameters n G N and p G [0, 1] if 

p{x = k) = (ffjp^{i-pr-\ 

for A; = 0, 1, . . . , n. The random variable X represents the number of heads when 
we flip a coin n times, where each flip gives heads with probability p. In particular, 
we can write such a random variable X as a sum 

n 

2=1 

where = 1 if the ith flip yields a head, and Yi = 0 otherwise. It is clear that 
E{Yi) = p, and according to the sum formula for expectations, we find that 

E{X) = np. □ 

4 Exercise 2.3.29. Use the sum formula for variances to show that var(X) = 
np{l-p). 
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Here we see the convenience of the sum formulas. Computing the expectation 
of X directly from its probability mass function is possible but tedious work. The 
sum formula makes this much easier. 



Example 2.3.30. (Poisson distribution) Recall that random variable X is said to 
have a Poisson distribution with parameter A > 0 if 

for /c = 0, 1, To compute its expectation, we write 



E(X) 









k=l 



oo \k-l 

Ae-"V7A-TT=A. 



□ 



4b Exercise 2.3.31. Show that the variance of X is also equal to A. 

Example 2.3.32. (Geometric distribution) Recall that a random variable X has a 
geometric distribution with parameter p G [0, 1] if 



P{X = k)=p{l-p)^-\ 



for A: = 1, 2, To compute its expectation, we write 

oo 

e{x) = pJ2H^-p)’'~^- 

k=l 

Let us denote A:(l — by Sn, and YlT=i by S. The following 

trick is quite standard. We can write (1 — p)Sn as 

(1 - p)Sn = (1 - p) + 2(1 - pf + 3(1 - p)^ H + n(l - p)^. 



Hence, 

pSn = Sn-{l-p)Sn = l + {l-p) + {l-pf + ■■■ + {!- p)”"' ~ n(l - p)”, (2.1) 
and we find that 

^ _ 1 + (1 -p) + (1 -p)^ H h(l-p)"“^ n(l-p)" 

P P 

Now recall the geometric series Yl^=o ~ when \a\ < 1. It follows from this 
that Sn tends to p“^ as n ^ oo, and therefore S — p~‘^ and 
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4 Exercise 2.3.33. Do you understand why we first had to look at Sn and then 
take the limit for n oo? The problem has to do with the fact that you cannot 
a priori assume that S is finite. Why would this cause trouble in (2.1)? 

Example 2.3.34. (A distribution without an expectation) Let X have probability 
mass function 

P(X = *) = §, 

for all k e Z,k 0, and where C is chosen such that the sum equals 1. This 
random variable does not have an expectation, since the sum 



+ CX) 

E 



k= — oo,k^0 



Ck 



is not well defined. 



□ 



2.4 Random Vectors 

Apart from the discussion of independence, we have so far studied individual 
random variables. Very often however, the interplay between different random 
variables (defined on the same sample space) is very important, and not only in 
situations with independence. 

Example 2.4.1. Suppose that I ask you to throw a coin twice. The outcome of this 
experiment is a vector, which we may denote by (A, V), where X represents the 
first throw, and Y the second. Now suppose that there are two types of students: 
diligent and lazy. The diligent student really throws the coin twice, and his or her 
outcome can be any of the four possibilities (writing 0 for heads and 1 for tails) 
(0, 0), (0, 1), (1, 0) and (1,1). Clearly, we want the four outcomes to be equally 
likely. The lazy student however, throws only once and copies the outcome for the 
second coordinate of his outcome. His (or her) possible outcomes are (0, 0) and 
(1, 1), again with the same probability. Suppose now that we only look at the first 
coordinate X. In both cases, the probability that A = 0 is equal to 1/2. Also, 
when we look at the second coordinate, the probability that V = 0 is the same 
for the diligent and lazy student. Hence, by looking at the individual outcomes A 
and y, we can not distinguish between the two types of student. The only way 
to distinguish the two students, is to look at the outcome of the complete vector 

(x,y). □ 

Definition 2.4.2. A random vector (Ai, . . . , A^) is a mapping from a sample space 
Q into R^. 

Definition 2.4.3. The joint probability mass function of a random vector A = 
(Ai, A 2 , . . . , Xd) is defined as 

Px{xi,X2,...,Xd) = P{Xi =xi,...,Xd = Xd). 
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The distribution of any of the individual is referred to as a marginal distri- 
bution, or just a marginal 

Definition 2.4.4. The joint distribution function of X = (Xi, . . . , Xd) is the func- 
tion Fx : > [0, 1] given by 

Fx{xi,...,Xd) = P(Xi <xi,...,Xd< Xd). 

In Example 2.4.1 it became clear that it is possible to have two random 
vectors (X, Y) and (V, W) so that X and V have the same marginal distribution, 
V and W also have the same marginal distribution, but nevertheless the joint 
distributions are different. Hence we cannot in general find the joint distributions 
if we only know the marginals. 

The next result shows that the opposite direction is possible: if we know the 
joint distribution, then we also know the marginal distributions. This reinforces 
the idea that the joint distribution really says more than the collection of marginal 
distributions. 

Theorem 2.4.5. Let (Xi, X2 , . . . , X^^) have probability mass function p{xi , . . . , Xd). 
Then the mass function of Xi can be written as 

Px^{xi)= p{xi,X2,...,Xd), 

^2 1^3 i‘ • • 1^ d 

and similarly for the other marginals. In words, we find the mass function of Xi 
by summing over all the other variables. 

Proof. This is an immediate consequence of Theorem 1.4.7, where we take A to 
be the event that Xi = xi and the Bfs all possible outcomes of the remaining 
coordinates. □ 

Exercise 2.4.6. Provide the details of the last proof. 

Example 2.4.7. Let (X, T ) have joint probability mass function p given by p(0, 0) = 
0.4, p(0, 1) = 0.2, p(l,0) = 0.1 and p(l, 1) = 0.3. Then 

P(X-0)=p(0,0)+p(0,l)=0.6 



and 

P{X = l)=p{l,0)+p{l,l) = 0A. 

We can read off all information about the vector (X, Y) from the following table. 
The last column contains the marginal probability mass function of X, obtained 
by summing all probabilities row-wise. The last row contains the probability mass 
function of F, obtained by summing the probabilities column-wise. 





0 


1 


X 


0 


0.4 


0.2 


0.6 


1 


0.1 


0.3 


0.4 


Y 


0.5 


0.5 


1 
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It turns out that the joint distribution comes in handy when deciding whether 
or not two or more random variables are independent. Indeed, the very definition 
of independence implies that X and V are independent if and only if 

P{x,Y){x,y) =Px{x)pY{y), 

for all X and y. The following result makes it sometimes even easier to decide 
whether or not X and Y are independent. The point is that the functions g and 
h below need not be the marginal mass functions of X and Y respectively. 

Theorem 2.4.8. The random variables X and Y are independent if and only if 
P(x,Y) can be factorised as a function of x and a function of y, that is, can be 
written as P(x,Y){^^y) — 9{^)h{y) for all {x,y). 

Proof. If X and Y are independent, then clearly we can take g and h as the 
marginal mass functions of X and Y. To prove the converse, we write p for the 
mass function of (X,Y). Suppose that 

p{x,y) = g{x)h{y) 

for all X and y. Summing over x and using Theorem 2.4.5 leads to 

Px{x) = '^pix, y) = g{x) h{y). 

y y 

Since ^^Px{x) = 1 this leads to g{x) Y^y h{y) = 1. Similarly, we find 

PY{y) = h{y)^g{x). 



Hence we obtain 

Px{x)pY{y) = g{x)h{y) 9{x) 

y X 

= g{x)h{y) = p{x,y), 



which is what we wanted to prove. □ 

Note that it follows from this proof that if p{x,y) = g{x)h{y), then g{x) 
differs from px{x) by only a multiplicative constant, and similarly for h{y) and 
PY{y)- So although g{x) is not necessarily px{^)^ the difference is only a matter 
of normalisation, and similarly for h{y) and PY{y)- 

Example 2.4.9. Let (X, T) have joint mass function 
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for k,l = 0,1,2,..., and where A,/i > 0. It is obvious that p{k,l) factorises as 
the product of a function of k and a function of /, and therefore X and Y are 
independent. To compute the marginal distribution of X we compute 

P{X = k) = 



which we recognise as the probability mass function of a Poisson distribution with 
parameter A. Similarly, Y has a Poisson distribution with parameter p. □ 

Example 2.4.10. Let {X^Y) have joint mass function 

C • 2“^ 

p(k, n) = , for /c = 1, 2, . . . and n = 1, . . . , /c, 

n 

and suitable constant C. It seems that p{k^ n) can be factorised as p(k^ n) = C2~^ ^ 
which would imply that X and Y are independent. However, this is not true. There 
is dependency between X and T, which you can see when you look at the values 
taken by k and n. In fact, to avoid the range conditions, we could write 

C2~^ 

p{k,n) = l[n<k}{k,n), 

where n) is an indicator function taking the value 1 if n < A: and 0 

otherwise. Now it is clear that p{k, n) cannot be factorised, and we see that X and 
Y are not independent. □ 

We can also use joint mass functions to determine the distribution of a sum 
of two random variables. 

Theorem 2.4.11. Let {X,Y) be a random vector on some sample space, with mass 
function p, and let Z = X + Y . Then the mass function pz of Z is given by 

pz{z) = '^p{x,z-x) = '^p{z-y,y). 

X y 

In particular, when X and Y are independent we find 

Pz(z) = - a:) = - y)PY{y)- 

y 



Yp(k,i) 



1=0 



(A+/J,) 



k\ ^ n 

1=0 



k\ 






V 

fc! ’ 



X 
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4 Exercise 2.4.12. Prove this theorem. 

Example 2.4.13. Suppose X and Y are independent with joint mass function as 
in Example 2.4.9. This means, as noted before, that X and Y have a Poisson 
distribution with parameters A and p respectively. How can we compute the dis- 
tribution of the sum X YY7 According to Theorem 2.4.11 and Exercise 1.5.7, we 
find 



p{x + y - z) 



^ \k,,z — k 

V ^ ^ c-(A+m) 

k\{z — k)\ 







aV" 



k 



-(a+m) + 

z! • 



This means that X -hY again has a Poisson distribution, whose parameter is the 
sum of the original parameters. □ 



We end this section with the analogue of Lemma 2.3.5 for vectors. 

Lemma 2.4.14. Let (Xi, X 2 , . . . , Xd) he a random vector and g : ^ R. Then 

E{g{Xi,...,Xd))= g{xi,...,Xd)P{Xi=xi,...,Xd = Xd), 

Xi , . . . 

whenever this sum is well defined. 

4 Exercise 2.4.15. Prove this lemma. 



2.5 Conditional Distributions and Expectations 

One of the most important concepts in probability theory is that of conditioning. 
Very often, we want to know something about a certain random variable when 
some information about another is available. In principle, we have dealt already 
with conditional probability in Chapter 1, but now we will talk about this in terms 
of random variables. In the following, X and Y are random variables on the same 
sample space. 

Definition 2.5.1. The conditional probability mass function of Y given X = x is 
defined as 

PY\x{y\x) = P{Y = y\X = x), 

whenever P{X — x) > 0. The conditional distribution function of a random vari- 
able Y given X = x is defined as 

FY\x{y\x) = P{Y<y\X = x), 



whenever P{X = x) > 0. 
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It follows from this definition that 

^ I P{x,Y){x,y) ^ ^ 

“ Px(x) ■ 

4 Exercise 2.5.2. Derive formula (2.2). 

When we learn that X = our conditional probability mass function for Y is 
PY\x{y\^): which should be seen as a function of y. The corresponding expectation 
is then given by J2yyPY\x{y\^)’ Notice that this conditional expectation is a 
function of x, the outcome of X. We denote this function of x by E{Y\X = x). 

Definition 2.5.3. The conditional expectation oi Y given X is denoted by E{Y\X = 
x), and defined as 

E{Y\X = x) = E ypY\x{y\x). 
y 

Example 2.5.4. Consider the vector (X, Y) in Example 2.4.7. We have 



py|x(0|0) = P(F = 0|X = 0) 

p(0,0) _ 0.4 2 

P(X = 0) “^“3' 

Similarly, py|x(l|0) = 5, and this leads to 

E(Y\X = 0)=0-^ + l-^ = ^. □ 

4 Exercise 2.5.5. Compute E{Y\X = 1), E{X\Y = 0) and E{X\Y = 1) in this 
example. 

Example 2.5.6. Consider, as in Example 2.4.13, two independent random variables 
X and T, with Poisson distributions with parameters A and p respectively. We 
are interested in the conditional distribution of X, given X -\-Y. That is, we want 
to compute P{X = k\X + Y = m + k). Keeping in mind that X -\-Y has again a 
Poisson distribution with parameter A + //, we find 



P{X = k\X-\-Y = m + k) 



P{X = k,Y = m) 



P{X -\-Y = m + k) 



k\^ m! ^ 



(A+/i)^+fc 

(m+/c)! 



g-(A+/x) 



(m + fc)! 
k\m\ (A + 

/m + k\ P A V 

V k AWm/ 



A A™ 



A + 



1 
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This is to say that (take m-\- k = r) 



Px\x+Y{k\r) = 



kj \A + /X 



SO that the conditional distribution of X given X-\-Y = r is a binomial distribution 
with parameters r and A/(A + p). Hence 

E(X\X + Y = r) = -2^. □ 

^ ' ’ X+p 

Conditional expectations are extremely useful for computing unconditional 
expectations. This is articulated in the following theorem. 

Theorem 2.5.7. For random variables X and Y defined on the same sample space, 
we have 



Proof. 



E{Y) = = ^)- 



E{Y) = Y^yP{Y = y) 

y 

y X 
y X 

= Ep^^^^YypY\x{y\x) 

X y 

= Y^P{X = x)E{Y\X = x). 



Example 2.5.8. A chicken produces N eggs, where AT is a random variable with a 
Poisson distribution with parameter A. Each of the eggs hatches with probability 
p, independently of the other eggs. Let K be the number of chicks. We want to 
find E{K\N — n) and E{K). To do this, note that the assumptions tell us that 

PN{n) = 

71 } 



PK\N{k\n) = -p)" 
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This gives us that E{K\N = n) = == pn, and hence 

E{K) = E{K\N = n)P{N = n) 

n 

= pnP{N = n) = p nP{N = n) 

n n 

= pE{N) = pX. 

It is a little more work to compute E{N\K = k). To do this, we assume that n> k 
(why?) and write 



P{N = n\K = k) 



P{N = n,K ^k) 

P{K = k) 

P{K = k\N = n)P{N = n) 

Y.ZknK = k\N = n)P{N = n) 

\kpk ^oo /I 

k\ Z^m=0 (m+/c)! V ^1 m\ 

ElJ^p\{l-p) 

k\^ 

[n-k)\ 



This we recognise as the probability mass function of a Poisson random variable, 
and we now see that 



E{N\K = k) 



YnP{N = n\K = k) 



n=k 



oo 

y^(fc + n- k) 

n=k 



(n-fc)! 



fc + A(l-p), 



where we use the fact that a random variable with a Poisson distribution with 
parameter A(1 — p) has expectation A(1 - p). □ 



Exercise 2.5.9. Explain why this last result agrees with our probabilistic intu- 
ition. 



The following amusing and important example shows that we must be ex- 
tremely careful when we deal with conditional expectations and distributions. 
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Example 2.5.10. (First envelope problem) Suppose that I give you two envelopes, 
both containing a certain amount of money. I do not tell you the exact amounts, 
but I do tell you that one envelope contains twice the amount of the other. For 
convenience, we also assume that the amounts are powers of 2, that is, of the form 
2^ for some n G Z. Now you have to choose an envelope, but you are not allowed 
to open it. After choosing this envelope I will give you the chance to swap. Does 
it make sense for you to do that? Clearly, common sense tells us that swapping 
couldn’t make the slightest difference, but consider the following reasoning: 

The amount of money in the chosen envelope is denoted by X, the amount 
in the other is denoted by Y. That is, we consider a random vector (X,Y). Now 
suppose that X = x. What is the conditional expectation of Y given this infor- 
mation? That is, we want to compute E{Y\X = x). When X — x, Y can take 
only two values, namely 2x and x/2. Since you have chosen a random envelope, 
you have probability 1 /2 to have the envelope with the largest amount of money. 
That is, we have 



P(Y = 2a:|X = x) = P{Y = ||X ^ x) = ^. 

This leads to 

This is very strange, since this conditional expectation of Y given X = x is now 
strictly larger than x, the amount of your chosen envelope?! This suggests that 
you should swap, since the expected amount of the other envelope is larger than 
what you have in your hands. 

There must be something wrong, and we will try to explain now what the 
trouble is. Consider the random vector (X, T). The procedure of choosing the 
envelopes makes some implicit assumptions about the distribution of this random 
vector. In the first place, the marginal distributions of X and Y must be the same 
(why?). Also, £is articulated above, 

P{Y = x) = P{Y = x\X = 2x)P{X = 2x) + P{Y = x\X =^)P{X = ^) 

= \p{X = 2x) + \p{X = ^). 

Taking the last two observations together now gives 

P{X = x) = \p{X^2x) + ^-P{X = ^). 

Now recall that we have assumed that X takes only values of the form 2”. Denote 
P{X = 2”) by q„. The last display tells us that for all n G Z, we have 

Qn-l + Qn+l 

Qn = ^ • 




58 



Chapter 2. Random Variables and Random Vectors 



A moment of thought reveals that this implies that all points (n^qn), plotted in 
the plane, must lie on a straight line. If the slope of this straight line is not zero, 
then there will be qVs with q^ < 0, which is clearly impossible, since qn represents 
a probability. On the other hand, if the line has slope zero, then qn — c for some 
c. If c = 0, then ^^qn = 0 which is impossible since they should sum up to 1, 
being the probability mass function of A. If c > 0, then ^^qn = oo, which is 
also not allowed, for the same reason. We conclude that a random vector with the 
properties implicitly assumed, does not exist. 

This is a most remarkable conclusion. The point is that an experiment as 
described cannot be performed. Apparently, it is impossible to distribute amounts 
of money over the envelopes such that no matter what I see in the firsts the amounts 
in the other are half or twice this amount, with equal probabilities. □ 



2.6 Generating Functions 

In this section we have a special look at random variables that take values in N. We 
shall introduce a new concept, the generating function of such a random variable. 
There are at least two good reasons for doing so. 

In the first place, generating functions are a very convenient tool for all sorts 
of computations, that would be difficult and tedious without them. These com- 
putations have to do with sums of random variables, expectations and variances. 
For an application of this, see Section 6.5. 

In the second place, they are a very good warm-up for the concept of charac- 
teristic functions which are similar in nature but not restricted to integer- valued 
random variables. In this section. A, T, . . . are random variables taking values in 
N. 

Definition 2.6.1. The generating function of A is defined as 

Gx{s) = E{s^) 

OO 

71=0 

for all s G R for which this sum converges. 

4 Exercise 2.6.2. Show that Gx{s) converges for at least all s G [0, 1]. 



Before we continue, here are a few examples. 
Example 2.6.3. If P(A = c) = 1, then Gx{s) = s^. 



□ 
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Example 2.6.4. (Geometric distribution) If X has a geometric distribution with 
parameter p, then (recall again the geometric series) 

00 

Gx(s) = 

k=l 

oo 

^ k = l 

_ p s{l — p) sp 

1 - p {1 - s{l - p)) 1-sFsp' □ 

4 Exercise 2.6.5. Show that the generating function of a random variable X with 
a Poisson distribution with parameter A, is given by 

Working with generating functions requires some knowledge of infinite power 
series. We do not aim to treat the general theory of such infinite series, and we 
will state all we need in one theorem, the proof of which can be found in most 
textbooks on analysis. 

Theorem 2.6.6. Let Ga{s) = be a power series, where a = (uq, ai, U 2 , . . .) 

is a sequence of non-negative real numbers. Then 

(a) There exists a radius of convergence R>0, such that the series converges for 
|s| < R, and diverges for |s| > R. 

(b) We can differentiate Ga{s) term by term for all |s| < R. 

(c) If Gb{s) = another power series with 

Ga{s) = Gb{s), 

for all 1^1 < R' for some 0 < R' < R, then an = bn for all n. 

(d) (Abel’s theorem) If R>1, then 

oo 

limGo(s) = Wan. 

sT 1 * ^ 

n=0 

We will first deal with the calculation of expectation and variance. Since our 
random variables in this section are non-negative, we have no problem with the 
existence of the expectation. When we talk about the fcth derivative at 1, G^^^(l), 
we mean limg 

Theorem 2.6.7. Let X have generating function G. Then 

(a) E{X) = 

(b) var(X) = G''{\) + G'{1) - G'(l)2. 
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Proof. For (a) we take 5 < 1 and write, using Theorem 2.6.6(b) and (d), 

00 

n=l 

00 

^ Y,npx{n) = E{X), 

n=l 

as s t 1. For (b) we do something more general first: 

4b Exercise 2.6.8. Show, using the same idea as in the proof of (a), that for all 
k = 1 2 

E{X{X -k + l)) = G^^\l). 

The expression in (b) now follows, since 

var(X) = E{X^) - {E{X))^ 

= E{X{X-1))^E{X)-{E{X))^ 

= G"(l) + G'(l)-G'(l)^ □ 

Example 2.6.9. Let X have a Poisson distribution with parameter A. Then Gx {s) = 
Hence G^(l) = A and G^(l) - A^. It follows that E{X) = A and 
var(X) = A^ + A — A^ = A. □ 

4b Exercise 2.6.10. Find the expectation and variance of a geometric distribution 
using generating functions. 

Generating functions can also be very helpful in studying sums of random 
variables. 

Theorem 2.6.11. If X andY are independent, then 

Gx+y{s) = Gx{s)Gy{s). 

Proof Since X and Y are independent, so are and . Hence 

Gx+y{s) = E{s^+^) = E{s^s^) 

= E{s^)E{s^) = Gx{s)Gy{s). □ 

4b Exercise 2.6.12. Extend this result to any finite number of random variables. 

Example 2.6.13. This result can be used to compute the generating function of the 
binomial distribution. Indeed, if X has a binomial distribution with parameters 
n and p, then X has the same distribution as the sum of n independent random 
variables Xi,...,X^ , all with the same distribution, and P{Xi = 1) = p = 
1 — P{Xi = 0 ). Clearly, Gxi{s) = {1 — p) Pps, and hence, 

Gx(s) = IIGx.Cs) = ((1 -p) +ps)". 

i=l 



□ 
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^ Exercise 2.6.14. Use generating functions to prove that the sum of two indepen- 
dent random variables with a Poisson distribution, has again a Poisson distribu- 
tion. Be very precise about what part of Theorem 2.6.6 you use in your argument. 



2.7 Exercises 



Exercise 2.7.1. Suppose that we throw a fair coin three times. Let X be the number 
of heads. Find the probability mass function and distribution function of X. 



Exercise 2.7.2. Compute the expectation of X in the previous exercise. 

Exercise 2.7.3. Compute P{X > 2) when X has a Poisson distribution with pa- 
rameter 3. 



Exercise 2.7.4. Compute P{X > 10) when X has a geometric distribution with 
parameter p. 

Exercise 2.7.5. Compute P{X = 3) when X has a negative binomial distribution 
with parameters 2 and p. 



Exercise 2.7.6. An urn contains 8 white, 4 black, and two red balls. We win 2 euro 
for each black ball that we draw, and lose 1 euro for each white ball that we draw. 
We choose three balls from the urn, and X denotes our winnings. Write down the 
probability mass function of X. 

Exercise 2.7.7. Suppose we throw a die twice. Let X be the product of the two 
outcomes. 

(a) Compute the probability mass function of X. 

(b) Compute P{X = k\D)^ where D is the event that the sum of the outcomes is 
equal to 7. 

Exercise 2.7.8. Let A be a random variable with probability mass function given 
by p(-4) = p(4) = and p{2) = p{-2) = Let Y = aX -h h, for certain a and 
h. 

(a) For what values of a and b does Y take values in {0, 1, 2, 3, 4}? 

(b) Compute the probability mass function of Y for these choices of a and b. 

Exercise 2.7.9. The NBA (National Basketball Association) draft lottery involves 
the 11 teams that had the worst won- lost records during the year. A total of 66 
balls are placed in an urn. Each of these balls is inscribed with the name of a team: 
11 have the name of the team with the worst record, 10 have the name of the team 
with the second worst record, and so on. A ball is then chosen at random and the 
team whose name is on the ball is given the first pick in the draft of players about 
to enter the league. Another ball is then chosen, and if it belongs to a different 
team, then the team to which it belongs receives the second draft pick. (If the 
ball belongs to the team receiving the first pick, then this ball is discarded and 




62 



Chapter 2. Random Variables and Random Vectors 



another one is chosen, and so on, until another team is chosen.) Finally, another 
ball is chosen and the team named on the ball (provided it is different from the 
previous two teams) receives the third pick. The remaining draft picks 4 through 
11 are then awarded to the 8 teams that did not win the lottery in inverse order 
of their won-lost record. 

Let X denote the draft pick -of the team with the worst record. Find the 
probability mass function of X. 

Exercise 2.7.10. Let X be a random variable with probability mass function 

px{n) = ci-, 

77 , 



for n = 0, 1, 2, . . ., and pxip) — 0 otherwise. 

(a) Compute c. (Use the expansion 

(b) Compute the probability that X is even. 

(c) Compute the expectation of X. 

Exercise 2.7.11. Suppose we want to test a large number of blood samples in 
order to see if they contain a certain antibody. To reduce the amount of work, 
one proceeds as follows. We divide the samples into groups of size A;, and these 
k samples are put together. The resulting mixtures are tested. If the test of such 
a mixture is negative, no further action is required. If it is positive, then the k 
original samples are individually tested after all, so that in such case, a total of k-\-l 
tests needs to be performed. The samples contain the antibody with probability 
p, independently of each other. 

(a) What is the probability that a mixture of k samples contains the antibody? 

(b) Let S be the total number of tests that needs to be performed when the original 
number of samples is n = mk. Compute E{S) and var(5). 

(c) For what values of p does this method give an improvement, for suitable k when 
we compare this to individual tests right from the beginning? Find the optimal 
value of A; as a function of p. 

Exercise 2.7.12. Suppose that X has a geometrical distribution with parameter p. 
Show that 

P{X = m + k\X >m) = P{X = k). 

Explain why this property is called the lack of memory property of the geometrical 
distribution. 

Exercise 2.7.13. Compute the expectation of X when X has a negative binomial 
distribution. 

Exercise 2.7.14. A gambling book recommends the following strategy for the game 
of roulette. It recommends that a gambler bet 1 euro on red. If red appears (with 
probability ^), then the gambler should take his or her profit and quit the game. 
If the gambler loses his or her bet (which has probability || of occurring), he 
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should make additional 1 euro bets on red on each of the next two spins of the 
roulette wheel and then quit. Let X denote the gamier ’s winnings when he quits. 

(a) Find P{X > 0). 

(b) Is this a winning strategy? Explain your answer. 

(c) Compute E{X). 

Exercise 2.7.15. Suppose that X has a Poisson distribution with parameter A. 
Show that 

P(x = k) = yP{X = k-l), 
k 

for k = 1,2, Use this to find k so that P{X = k) is maximal. 

Exercise 2.7.16. Suppose that the number of children A/' of a randomly chosen 
family satisfies 

for n = 0, 1, . . .. 

(a) What is the probability that a family has no children? 

(b) Compute the expectation of N. 

Now suppose that a child is equally likely to be a girl or a boy, and let X be the 
number of daughters in a randomly chosen family. 

(c) Compute E{X\N — n) and use this to compute E{X). 

Exercise 2.7.17. Let {X,Y) be a random vector with probability mass function 
Pix,Y){i,j) = 1/10, for 1 < i < j < 4. 

(a) Show that this is indeed a probability mass function. 

(b) Compute the marginal distributions of X and Y. 

(c) Are X and Y independent? 

(d) Compute E{XY). 

Exercise 2.7.18. Compute E{X\Y = y) and E{Y\X = x) in the previous exercise. 

Exercise 2.7.19. Suppose that 15 percent of the families in a certain country have 
no children, 20 percent have 1, 35 percent have 2, and 30 percent have 3. Suppose 
further that each child is equally likely to be a boy or a girl, independent of the 
other children. A family is chosen at random, and we write B for the number of 
boys in this family, and G for the number of girls. Write down the joint probability 
mass function of (B^G). 

Exercise 2.7.20. We roll two fair dice. Find the joint probability mass function of 
X and Y when 

(a) X is the largest value obtained, and Y is the sum of the values; 

(b) X is the value on the first die, Y is the largest value; 

(c) X is the smallest value, and Y is the largest. 

Exercise 2.7.21. Compute E{Y\X = x) for all random variables in the previous 
exercise. 
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Exercise 2.7.22. Compute E(X\Y = y) and E{Y\X — x) in Example 2.4.10. 

Exercise 2.7.23. Compute E(y\X = x) for the lazy and the diligent student in 
Example 2.4.1. 

Exercise 2.7.24. Suppose that we choose three different numbers randomly from 
{1,2,..., 10}. Let X be the largest of these three numbers. 

(a) Compute the distribution function of X. 

(b) Compute P{X - 9). 

Exercise 2.7.25. Suppose that a given experiment heis k possible outcomes, the 
ith outcome having probability pi. Denote the number of occurrences of the zth 
outcome in n independent experiments by Ni. Show that 



P{Ni 



,i = l,...,k) = 



n\ 



ni\ri2\ 



_„ni ri2 , 

|F1 P2 Pk • 



This is called the multinomial distribution. 



Exercise 2.7.26. Compute the marginal distributions of the multinomial distribu- 
tion. 



Exercise 2.7.27. Suppose that X and Y are independent, and both have the uni- 
form distribution on (0, 1, , . . . , n}, that is, P{X = i) = P{Y = i) = l/(n + 1), for 
all i = 0, . . . , n. Show that 



p[X + Y = k) 



nYl — \n — k\ 
(n + 1)^ 



for all A: = 0, 1, ... , 2n. 

Exercise 2.7.28. Let X and Y be independent and geometrically distributed with 
the same parameter p. Compute the probability mass function of X — Y. Can you 
also compute P{X = Y) now? 

Exercise 2.7.29. Let X and Y be as in the previous exercise. Compute E{X\X -f 
Y = k) for all A: = 2,3,.... 

Exercise 2.7.30. Let X be a random variable taking values in N. Assume that 
P{X = k)> P{X = A: + 1), for all k. Show that 



P{X = k)< 



2E{X) 

A;2 



Exercise 2.7.31. We throw a die repeatedly, until the total sum of the outcomes is 
at least 1400. Let (3 be the probability that we need more than 420 throws of the 
die for this. Use Chebyshev’s inequality to obtain a bound for /?. 
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Exercise 2.7.32. Let be independent random variables with the 

same distribution, with finite variance and negative expectation ii. Let Sn = 
+ • • • + Xn be their sum. Use Chebyshev’s inequality to show that for any 
constant c, we have 

lim P{Sn > c) = 0. 

n— >oo 

Exercise 2.7.33. Let X and Y be independent binomial random variables with 
parameters n and p. Denote their sum by Z. Show that 

/nW n \ 

P{X = k\Z = m) = hr" . 

\m) 

for fc = 1, . . . , m. 

Exercise 2.7.34. Let Xi, be independent random variables with 

P{Xi = 1) = Pi and P{Xi = 0) = 1 — for all i = We write Y = 

Xi + • • • + Xn. 

(a) Show that E{Y) = 

(b) Show that var(y) = YA=iPi(^ ~ Pi)' 

(c) Show that, for E{Y) fixed, the variance var(y) is maximal when all the p^’s 
are the same: pi = • - = pn = I /n. 

(d) The result in (c) seems counterintuitive: the variance of the sum is greatest if 
the individuals are most alike. Try to explain this phenomenon. 

Exercise 2.7.35. (Secretary’s problem) A secretary drops n matching pairs of let- 
ters and envelopes down the stairs, and then places the letters into the envelopes 
in a random order. Let X be the number of correctly matched pairs. 

(a) Show that the probability that a given envelope contains the correct letter is 
equal to 1/n. 

(b) Use (a) to show that the expectation of X is equal to 1. 

(c) Show that the variance of X is also equal to 1. 

(d) Show that for all fc, 

P(X = k)^'^. 

as n ^ oo. 

Exercise 2.7.36. Suppose that X and Y are random variables on the same sample 
space, taking values in N. Suppose that X{lj) < Y{cu), for all cj G D. Show that 
E{X) < E{Y). 

Exercise 2.7.37. Suppose we throw a coin five times. Let X be the number of 
heads, and Y be the number of tails. We assume that the coin is biased, and that 
the probability of heads is equal to 1/3. We are interested in Z = X — Y. 

(a) Express Z as sl function of X. 

(b) Compute E{Z) without computing the probability mass function of Z. 

(c) Compute the probability mass function of Z, and use this to compute E{Z) 
once again. 
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Exercise 2.7.38. (Cantelli’s inequality) Suppose that X has finite expectation p 
and finite variance Let a > 0. 

(a) Show that P{X > a) < P{{X + > (u + x)^). When is there equality? 

(b) Show that 



P{{X + xf > (<z + s)2) < 



mx+xf) 

{a + x)^ 



(cr^ + x^) 
(a + x)^ 



(c) Minimise (b) over x to obtain Cantelli’s inequality: 

P{X-n>a) < ^^2 + q 2 ) ’ 



for all a > 0. 

(d) Show that there can be equality in Cantelli’s inequality, by considering a 
random variable taking only two values. 

Exercise 2.7.39. Suppose that the number of times that a person gets sick during 
one year, has a Poisson distribution with expectation 5. A new medicine reduces 
this expectation to 3 for 75% of the population, and has no effect on the remaining 
25%. Suppose that John Smith uses the medicine, and gets sick twice. What is 
the conditional probability that the medicine has an effect on him? 

Exercise 2.7.40. A die is thrown five times. Use generating functions to compute 
the probability that the sum of the scores is 15. 

Exercise 2.7.41. Compute the generating function of the negative binomial distri- 
bution of Example 2.1.10. (Recall that this distribution is obtained as a sum of 
geometrical random variables, and use Example 2.6.4.) Use this to compute the 
expectation and the variance of a negative binomial distribution. 




Chapter 3 

Random Walk 



In this chapter, we concentrate on an experiment that is strongly related to coin 
flips, the random walk . We shall prove a number of results that are very surprising 
and counterintuitive. It is quite nice that such results can be proved at this point 
already. The proofs are all based on (ingenious) counting methods. 

3.1 Random Walk and Counting 

Consider a particle on the one-dimensional line, starting at position a G Z, say. 
This particle performs n random steps as follows. At each time t = 1,2,. ..,n, 
we flip a coin. If heads comes up, the particle moves one unit to the right, if tails 
comes up, the particle moves one unit to the left. The sample space corresponding 
to this experiment is simply f] = {— 1,+!}’^, the set of sequences of length n with 
the symbols —1 and +1. For instance, if n = 4, and lj = (1,1,— 1,1), then the first 
and second step are to the right, the third is to the left, and the fourth is to the 
right again. The reason that we define Q as {—1, 1}’^ rather than {0, l}’^, is that 
in the former case, we can conveniently represent the position of the particle after 
k steps by the random variable 5^ : > M as 

k 

Sk{^) = n + 

i=l 

for fc = 1, 2, . . . , n. The probability measure on associated to the random walk 
gives equal probability 2~'^ to all 2'^ possible outcomes. Hence, if we want to 
compute the probability of an event A, we need to count the number of elements 
in A and multiply by 2”^. We should denote this probability measure by but 
we will drop the subscript n when no confusion is possible. We will mostly be 
interested in the joint distribution of the SkS, that is, in the random vector 



S{c0) = {Si{uj),...,Sn{uj)). 
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Figure 3.1: The graph of a random walk with iSq = 0, = 1, ^2 = 0, etcetera. 



A convenient way to think about this random vector, is to construct the graph 
obtained by joining the points {{k,Sk)^k = l,...,n}, see Figure 3.1. We shall 
refer to such a graph as a path. Many important events can be described in terms 
of such paths. 

The first interesting question deals with the marginal distribution of Sn^ 
Theorem 3.1.1. For all —n + a<6<n + a, we have 

Proof. We need to count the number of outcomes a; G for which Sn{oo) = b. 
For the sum of n ±ls to be equal to 6 — a, it is necessary and sufficient to have 
^(n + & — a) Is, and ^{n-b-\-a) —Is. (To see how to obtain these numbers, denote 
the number of Is by and the number of —Is by n_. Then + n_ = n and 
n^—ri- = b—a. Solving these equations leads to the answer given above.) There are 
(i(^+b-a)) outcomes with this property, and the result follows immediately. □ 

4 Exercise 3.1.2. Give the marginal distribution of Sk for k < n. 

4 Exercise 3.1.3. Why is P{Sn = b) =0 when |6 - a| > n? 

The following definition turns out to be very convenient for counting pur- 
poses. 

Definition 3.1.4. We denote by Nn{a,b) the number of paths that start in (0,a) 
and end in (n, 6), and by N^{b) the number of such paths which contain at least 
one point on the x-axis. 
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Figure 3.2: The reflection principle. 



4 Exercise 3.1.5. Show, using Theorem 3.1.1, that 

Lemma 3.1.6. (Reflection principle) For a, 6 > 0 we have 

N^{a,b) = Nn{-a,b). 

Proof. Consider a path from (0, —a) to (n, b). This path must intersect the x-axis 
for the first time at some time fc, say. Now reflect the segment of the path between 
X = 0 and x = A: in the x-axis. The reflection is a path from (0, a) to (n, b) which 
intersects the x-axis, see Figure 3.1. This reflection operator gives in fact a one-one 
correspondence between the appropriate collection of paths, and we are done. □ 



4b Exercise 3.1.7. Show that the reflection operator indeed gives a one-one corre- 
spondence. 

Theorem 3.1.8. (Ballot theorem) Let b > 0. The number of paths from (0,0) to 
{n^b) which do not visit the x-axis (apart from the starting point) is equal to 

liV„(0,6). 

n 

Proof. Note that the first step of such a path is necessarily to (1, 1). After this first 
step, we need to make n — 1 additional steps, from (1, 1) to (n, b) in such a way that 
the path does not visit the x-axis. This can be done in b)—N^_i{l, b) ways. 
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According to the reflection principle, this is equal to Nn-iil^b) — 1,6), 

which can be computed as follows. 



4b Exercise 3.1.9. Interpret the special case b = n. 

4b Exercise 3.1.10. Do you understand why this is called the ballot theorem? Think 
of the following question: suppose that in a ballot, candidate A scores a votes, and 
candidate B scores b votes, where a > b. What is the probability that during the 
counting of the votes, A was always ahead of B? 

Theorem 3.1.11. Suppose that the random walk starts in 0, and let b ^ t). Then 
we have 

P(Sn = b,Skj^ 0, for alll<k<n) = = b). 

n 

Proof. Take 6 > 0 (the case 6 < 0 is the subject of the next exercise.) The event 
at the left-hand side occurs precisely when the path of the random walk does not 
visit the x-axis in the time interval from 1 to n, and Sn = b. The number of such 
paths is, according to the ballot theorem, equal to ^Nn{0,b). Now Nn{0,b) is the 
number of paths that have Sn = b. Since all paths have the same probability, the 
result follows. □ 

4b Exercise 3.1.12. Give the proof of the last theorem for the case 6 < 0. 

4b Exercise 3.1.13. Show that 



/ n-l n-1 \ 

\^(ti — 1 + 6 — l)y \^(ti — 1 + 6 + 1)/ 

(^- 1)! 

(i(n + 6-2))!(i(n-6-2))! 

|(n + 6)} 
n\ b 

(i(n-6))!(i(n + 6))!'n 

-( ^ ^ 

nmn + b)J' n 



P(5i52---5„7^0) = -P(|5„|). 

n 

We end this section with a consequence of Theorem 3.1.11 which is interesting 
in its own right, but also very useful in the next section. 

Lemma 3.1.14. For the random walk we have, for all m>0, 

P{SlS2--S2m^0)=P{S2m = 0). 
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Proof. 

2m 

P{S,S2---S2mJ^0) = P{S^S2---S2m^0,S2m = b) 

b=—2m 




where we interpret as being equal to 0. The last expression contains a 

so-called telescoping sum: when you write down all terms in the sum, you see that 
most terms cancel. The only terms that do not cancel are coming from 

A: = 1, and coming from k = m. Since the latter is defined to be 0, we 

find that 



P{SiS2--S2m^0) 



2 2 



— 2m 



2m — 1 
m 



P{S2m - 0 ). 




□ 



3.2 The Arc-Sine Law 

In this section, it is convenient to assume that the total number of steps of the 
random walk is even, that is, we cissume that the total number of steps is equal 
to 2n for some n G N. 

One might be inclined to think that if the particle makes 2n steps, starting 
at 0, the (random) last time at which the particle visits the x-axis tends to be 
closer to the end than to the beginning of the path. However, the following very 
surprising result shows that the distribution of this last visit to the x-axis, is in 
fact symmetric around n, the midpoint of the time interval. So, for instance, the 
probability that the last visit occurs at time 2 is the same as the probability that 
the last visit occurs at time 2n - 2. I think that this is very counterintuitive. 

The (random) last visit to 0, up to time 2n, is denoted by L 2 n- 

Theorem 3.2.1. For all 0 < k < n we have 



P{l2n = 2k) = P{S2k = 0)P{S2n-2k = 0). 
In particular, the distribution of L 2 n is symmetric around n. 
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Proof. We have that Z/2n = 2k if and only if 82k = 0 and there are no further 
visits to zero after that point, that is, if S2k-\-iS2k-\-2 * • • S2n ^ 0. Hence, 



P{L2n = 2k) = P{S2k=0, S2k+lS2k-^2'-S2n^0) 

/ r ^ 

= P ( 82k =0, ^ ^0, for all r = 2/c + 1, . . . , 2n 



i=2k-\-l 



P{ 82 k= 0 )pi forallr-2A: + l,...,2n 



Ki=2k-\-l 



P{82k = 0)P ( ^ cji 7^ 0, for all r = 1, . . . , 2n - 2A: 



Kl=l 



P{82k=0)P{8i82-’82n-2k^0) 

P{82k=0)P{S2n-2k=0), 



where the last equality follows from Lemma 3.1.14. □ 

It is perhaps not so clear why this result is related to (or sometimes even 
called) the arc-sine law. I will explain this now. For this, we need the following 
classical asymptotic result, which I give without proof. A proof can be found in 
Feller (1978). For two sequences of real numbers and 6^, n = 1,2,..., both 
converging to infinity as n ^ 00, we say that On ^ bn if On /bn ^ I as n 00. 

Le mm a 3.2.2. (Stirling’s formula) We have 

nl ~ n’^e'"'^(27rn)^/^ 

as n 00 . Moreover, it is the case that for all n, 

n^e-^(27m)i/2 < n! < 2 n^e-^(2'Kufl^ . 

The next theorem explains the title of this section. It says something about 
the behaviour of the random variable L2n when n is large. Clearly, when n tends 
to infinity, one expects that the moment of the last visit to 0 also typically gets 
bigger and bigger. Therefore, in order to obtain an interesting result, we have to 
normalise L2n- In this case, it turns out that we have to divide by 2n. 

Theorem 3.2.3. (Arc-sine law) For x G (0, 1), we have 

lim P2n ( = — arcsinx^/^. 
n— ^00 V 2n / 7T 



This type of convergence will later in this book be called convergence in 
distribution. It is a notion of convergence in terms of distribution functions In the 
proof of this theorem, we will need a technical result about Riemann integrals, 
which may not be so well known. The theorem provides a condition under which 
we can interchange limit and integral, and we give it without proof. 
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Theorem 3.2.4. (Dominated convergence) Let a < b and f^fn, n = 1,2,... he 

absolutely Riemann integrable functions with fn{x) f{x) as n ^ oo, for all 
X G (a, 6). Suppose in addition that there exists a non-negative function g with 
g{x)dx < oo and such that \fn{x)\ < g{x), for all x G (a, b) and all n. Then 

.b 

x= f{x)dx. 




Proof of Theorem 3.2.3. Let k < n. Since 






we can apply Stirling’s formula to obtain that 



P2n{S2k — 0 ) 



(7tA:)V2’ 



as /c OO (and n as well, of course). 

4 Exercise 3.2.5. Prove this last statement. 

It follows from this and Theorem 3.2.1, that 

P2„(i2n = 2fc) - 

when k ^ oo and n — k oo. Choose x G (0, 1) and choose a sequence of fc,n 
such that k/n X. This implies that 

n 1 

7r(A:(n — 7r(a:(l — ’ 

simply divide both the numerator and denominator on the left by n. Combining 
the last two displayed formulas now leads to 

^-P2n(L/2n — 2/c) > - — (^-^) 

as k^n tend to infinity so that k/n x. Now consider the normalised random 
variable L2n/2n, and define a function fn on (0, 1) in two steps as follows. First 
we define 




and then we define fn between these points by connecting consecutive points with 
straight line segments. It follows from (3.1) that 

fn{x) ^ -■ f{x), 

7T(x(1 — X)YI^ 
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for all X e (0,1). According to the second statement in Lemma 3.2.2, on any 
interval 0<u<x<v<l there is a constant C so that 

- (x(l-x))l/2- 

The right-hand side is an integrable function and it then follows from Theorem 
3.2.4 that ^ ^ 

f f„{x)dx f f{x)dx. (3.2) 

J U d U 

Now consider the distribution of L2n/2n: 

= i: 



k:u<k/n<v 

1 



= - E . 

n \ n 

k:u<k/n<v 



2n n 
k 



But 



as n 00 . Hence 



« , 0) L 



k:u<k/n<v 



0 , 



pv nv 

/ fn{x)dx = / f{x)dx. 
^Ju Ju 

Integrating now leads to the desired result. 



P { u < ^ lim 

2n / n— ^cxD 



□ 



3.3 Exercises 

Exercise 3.3.1. Consider a random walk Si, which starts in 0, and makes 2n steps. 
Let k < n. Show that 

(a) E{\S2k\\\S2k-i\ = r)=r; 

(b) E{\S 2 k-hi\ 1 15'2/cl = r) = 1 if r = 0, and E(|52/c+i| | \S 2 k\ = r) = r otherwise. 

Exercise 3.3.2. Consider a random walk making 2n steps, and let T be the first 
return to its starting point, that is 

T = min{l < k < 2n : Sk = 0}, 

and T = 0 if the walk does not return to zero in the first 2n steps. Show that for 
all 1 < fc < n we have. 

Exercise 3.3.3. Show that it already follows from Lemma 3.1.14 that 

P{L 2 n = 0) = P{L 2 „ = 2n). 
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We have already encountered a number of limit theorems. For instance, the law of 
large numbers Theorem 1.6.1 and the arc-sine law Theorem 3.2.3 were statements 
about the limiting behaviour of a growing number of random variables. In this 
chapter, we state new laws of large numbers, and a primitive version of the famous 
central limit theorem. 

The general setup in this section is as follows. We have random variables 
Xi, . . . , Xn defined on some sample space, with probability measure We typ- 
ically do not express the dependence on n of this probability measure, and write 
P instead of Pn- 



4.1 The Law of Large Numbers 



The law of large numbers in Theorem 1.6.1 was for very special random variables, 
namely those which only take the values 0 and 1. Note that when X takes the 
values 0 and 1 with equal probability (as in Theorem 1.6.1), the expectation of X 
is 1/2. Hence Theorem 1.6.1 says something about convergence of averages to an 
expectation. In the meantime, that is, since Theorem 1.6.1, we have defined and 
discussed the expectation of general discrete random variables, and therefore it 
makes sense to try to state (and prove) a more general law of large numbers which 
should again say something about convergence of averages to an expectation. It 
turns out that when we assume that the variance is finite, this is in fact quite easy, 
using Chebyshev’s inequality. Corollary 2.3.23. 



Theorem 4.1.1. (Weak law of large numbers) Let Xi, X 2 , . . . , X^ he independent 
random variables with the same distribution. Let E{Xi) = and var(Xi) = 
both he finite. Denote the sum Xi + • • • + Xn by Sn- Then, for any e > 0 we have 



n 



>e < 



which tends to zero when n tends to infinity. 
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Proof. We first compute the expectation and variance of Sn/n. The expectation 
of Sn/n is equal to and to compute the variance, we write 



var 



n 



= var 




i=l 



1 






J]var(Xi) 

i=l 




It then follows from Chebyshev’s inequality, Corollary 2.3.23 that 



n 






> e 



< 



var(5n/n) 



ne^ 



0, 



as n tends to infinity. □ 

4 Exercise 4.1.2. Show that this law of large numbers generalises Theorem 1.6.1. 



^ Exercise 4.1.3. Relax the condition that all X^’s should have the same distri- 
bution. Can you also relax the assumption of independence? (Look at the proof 
above for what you really need to make it work.) 

The weak law of large numbers as just stated is useful in many applications, 
we shall see examples of this later. However, one of the assumptions is that the 
variance of the random variables in question is finite. This condition is sometimes 
of course not met, and since the amount of extra work needed to get rid of this 
assumption is relatively modest, here follows a more general law of large numbers. 
An extra argument to include this one is that its proof is based on a very important 
technique in probability theory, namely the technique of truncation. 

Before we start, let us recall one more piece of notation. When X is a random 
variable, then we can define new random variables like l{x>a} which takes the 
value 1 if X > a and the value 0 otherwise, that is. 



1 



{X>a 




1 

0 



if X{iu) > a, 
if X{lu) < a. 



The subscript can of course be any restriction on X. Note that the expectation 
of such random variables is easily computed, since they take only two values. For 
instance, in the above case we have 



E{l{X>a}) — > ^)- 



Theorem 4.1.4. (General weak law of large numbers) Let Xi, X 2 , . . . , X^ be inde- 
pendent and identically distributed random variables, with £^(|Xi|) < 00 and write 
E{Xi) = p. Then for any e> 0 we have 



P 



Sn 

n 



> e 



0 , 



as n 



00 . 
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Before we prove this important result, we state a lemma which generalises 
Exercise 2.3.7. 

Lemma 4.1.5. Let X he a random variable taking only non-negative values, and 
let p> 0. Then 

poo 

E{XP) = / pxP-^P{X > x)dx. 

Jo 

Proof. We start with the right-hand side and work our way to the left-hand side 
as follows. 

poo poo 

/ px^~^ P{X > x)dx = / px’’~^'^P{X = k)l^k>x}{x)dx 

Jo Jo ^ 

poo 

= = pxP~'^l{^<k}{x)dx 

k 

pk 

= = k) px^~^dx 

= ^P{X = k)k^ = E{X^). 



Proof of Theorem 4.1.4. First, we observe that for x > 0 we have 

xP(|Xi|>a;) = a;P(l{|Xi|>x}) < P(|^i|l{|Xi|>x}) 

= Y,kP{\Xi\=k) 

k>x 

- 0 , 

when X ^ oo since E{\Xi\) — kP{\Xi \ = k) < oo by assumption. 
Next, we define 



P'Tl 



= E{X,l{\x,\<n}) = 



□ 



k<n 



Since E{Xi) < oo, we have that pin — ^ E{Xi) = //, as n — » oo. This means that it 
suffices to show that (writing Sn for the sum of the appropriate random variables 
as usual) 



P 




> e 



0 , 



(4.1) 



as n — > 00 . In order to do that we use truncation. We define new random variables 
Xn,k as follows: 

Xn,k = ^fcl{|Xfc|<n}- 



s'=x, 



n,l 



• + Xn 



Finally, we set 
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We may now write 

PI 



pn 


> e') 


= p( 


Sn 

Pn 


n 


/ 


\ 


n 



+P 



< P 



Sn 

Pn 

n 

Q' 

— - Pn 

n 



>e,Sn = S'„ 

> e,Sn^ S'n 

>e) +P{Sn^S'„[ 



The second term at the right-hand side is easily estimated, since 

P {Sn ^ S'^) < P {^k 7^ Xn,k for some k<n) 

< nP (|Xi| > n ) , 

which tends to zero, as observed at the beginning of this proof. 

The first term on the right-hand side is more complicated. Prom Markov’s 
inequality. Corollary 2.3.22 it follows that 






Pn 



> e 



E 



< 



s' 



Pn 



\i=l 

1 / « 



n 



nfjL 



rvar 



\i=i 



nvar(X„,i) 



ne^ 



According to Lemma 4.1.5 we have 

^n,l 



E 



^ = -J 2xP{\Xn,l\ > x)doL 

1 r 

< - 2xP{\Xi\> x)dx. 

^ Jo 



According to the first claim in this proof again, the integrand of the last integral 
converges to 0 as x oc. Hence, for any small 5 > 0, we can find xq so that 
2xP{\Xi\ > x) < 5 for all x > xq. Writing M < oo for the supremum over all x of 
the integrand, we can then bound the right-hand side by 



Xq + 0, 

n 



and the proof is complete. 



□ 
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Example 4.1.6. When you play roulette in a casino, the probability of losing is 
always slightly larger than the probability of winning. Suppose for simplicity that 
you win one euro with probability 49/100 and lose one with probability 51/100. 
We can model this by introducing independent random variables Xi,X 2 , . . . with 
P{Xi = 1) = 49/100 and P{Xi = —1) = 51/100. A short computation shows that 
E{Xi) = —1/50. Your capital after n games is 5^ = Ai + • • • + The law of 
large numbers now tells you that 




as n ^ oo. Hence Sn/n will, with high probability, be close to —1/50, and therefore 
Sn will be close to — n/50. You are bound to lose. Also, the probability that after 
n games you have won, can be estimated as 



P{Sn>0) = 



< 




which tends to 0 as n ^ oo, according to the law of large numbers. 



□ 



4.2 The Central Limit Theorem 

The law of large numbers tells us that the average of a number of random variables 
is in some sense close to the expectation of any of the individual random variables. 
But how close is close? All we know from the law of large numbers, is that when we 

divide Sn = Xi~{ h Xn by n, the result will, with high probability at least, be 

near E{Xi). But dividing by n is in some sense quite coarse. Deviations which are 
typically smaller than n cannot be seen anymore after dividing by n. For instance, 
for all we know, Sn could be near E{Xi) ± logn or near E{Xi) ± or . . . 

The central limit theorem will tell us that Sn will typically deviate by a mul- 
tiplicative factor of -\/n from the expectation. This will be true for any distribution 
of the random variables Xi, . . . , X^, as long as they have finite variance. In this 
section, we will state and prove a special case of this theorem in the case that the 
Xi ’s take only the values 1 and — 1 with equal probability. For this special case we 
can give a proof based on counting, very much in the same spirit as the proof of 
the arc-sine law in the previous chapter. 

Theorem 4.2.1. (Central limit theorem) Let Xi, X 2 , . . . , X^ be independent ran- 
dom variables with the same distribution, taking the values ±1 with equal proba- 
bility. Let S'n = Xi -h • • • + Xn be their sum. Then, for any a < b we have 




as n 



00. 
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Proof. We have seen already in Chapter 3 that 

P(5,„=2t)=(;;y--. (4.2) 

As in the proof of the arc-sine law, we will now use Stirling’s formula to estimate 
this probability. It will become clear from this where the somewhat magic factor 
y/n in the statement of the theorem comes from. Recall that Stirling’s formula 
says that 

n! ~ n'^e~^V27rn, 

as n ^ oo. Applying this to (4.2) leads to 



/ 2n ' 
\n + k 



2~2n 



(2n)! 

(n + k)\{n — k)\ 

(n + k)'^~^^{n — k)'^~^ (27r(n + /c))^/2(27r(n — 

n n 



(27r(2n))^/^ 



71 + k J \n — k 

1/2 



n 



n + k 



1 + - 
n 



—n—k 



X ( 1 + - 
n 



- 1/2 



n \ 


1/2 


n — k J 


• [ 


1 M 


-n-\-k 


'■nj 




k 


\ - 1/2 



(7rn) 



- 1/2 



1 - - 
n , 



(7m 



- 1/2 



1 + - 
n 



— k 






X 1 1 + - 
n 



- 1/2 



n) 



1 - - 
n 



e 



- 1/2 



(7m) 



- 1/2 



This somewhat boring computation does in fact bring us to an important observa- 
tion. At this point we need to use one of the standard limits from calculus, namely 
the fact that 



lim 

n^oo 



(-s) 



for all X G M. It is probably not immediately clear how we can use this in the 
computation above. In fact, we can only do this for a special choice of fc, and we 
shall now choose a k which depends on n. For a fixed x G M we choose 2k to be the 
nearest integer to x\/^- This means that \2k — x\/2n| is at most 1. If we pretend 
that 2k is exactly equal to xy^^, we see that the third term can be written as 
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as n ^ oo. Similarly, we find 



1 + - 
n 



= 1 + 



—x\nl2 



= 1 + 



/ 

xV2 






as n 00 . A similar computation shows that also 

k 



1 



fc 






4 Exercise 4.2.2. Give a proof of this last statement. 

Finally, when \2k — xV^\ < 1, the terms (1 + kjn)~^^‘^ and (1 — kln)~^l^ 
both converge to 0, as n ^ oo. Putting all these things together then gives 

P(52„ = 2k) ~ 

when we let n ^ oo, and choose k according to 2k = x\/^. At this stage of the 
proof, the magic ^/n in the statement of the theorem becomes somewhat clearer. It 
is only when we choose k and n related via 2k = xy/^ that we obtain a non-trivial 
limit here. In some sense, the computation forces us to make this choice. 

Apparently, S 2 n typically takes values of the order y/^. So we are now 
motivated to compute 

P{ay/^ < S2n ^ 

for any a <h.To this end we write 

P(a\/^ < S 2 n < bV^) = = 2m) 

m:ay/^<2rn<by/2n 

^ P{S2n = 2m) 

ra\a<2rn ! yj2n<h 

~ ^ (7rn)~^/^e“^ 

a<x<b: x^27jI \/T n 



where the last equality comes from the substitution 2m = and where 

2Z/V^ is the set {2z/y/^ : z G Z}. Now the last sum is a Riemann sum with 
step size (2/n)^/^. So we rewrite the last sum as 

(27r)-^/2e-^'/^(2/n)^/^ 

a<x<b: a:G2Z/\/^ 
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It is now clear that when n ^ oo, this converges to 




which is what we wanted to prove. □ 

4 Exercise 4.2.3. The above proof is valid for even n. Extend the result to all n. 



4.3 Exercises 

Exercise 4.3.1. Consider the following situation. We have two collections of random 
variables Xi and 1^. The X^’s are independent and indentically distributed with 
expectation 1, and the T^’s are also independent and identically distributed with 
expectation 2. Now we first flip a coin once , and we define, for all n, a random 
variable Sn by Sn = heads comes up, and Sn = ZlILi comes 

up. Show that 

(a) E{Sn/n) = 3/2. 

(b) P {\Sn/n — 3/2| > 1/4) does not converge to 0, when n ^ oo. 

(c) Why does (b) not contradict the law of large numbers? 

Exercise 4.3.2. (Local central limit theorem) Let Sn have a Poisson distribution 
with parameter n. Use Stirling’s formula to show that when {k — n)/^/n x, then 



V27rnP{Sn = k) 




Intermezzo 



So far we have been dealing with countable sample spaces. The reason for this, as 
mentioned before, is that we can develop many interesting probabilistic notions 
without needing too many technical details. However, there are a number of in- 
tuitive probabilistic ideas which can not be studied using only countable sample 
spaces. Perhaps the most obvious kind of things which can not be captured in the 
countable setting are those involving an infinitely fine operation, such as choosing 
a random point from a line segment. In this Intermezzo we see why this is so, and 
observe some problems that arise from this fact. After the Intermezzo, we continue 
with the study of continuous probability theory, which does capture such infinitely 
fine operations. 



I.l Uncountable Sample Spaces 

When the sample space is countable, any event A clearly has only countably many 
elements. Hence the sum 



p{A) = J2p{^)^ 



(1.3) 



which describes the probability of an event A, has a definite meaning. This is no 
longer true when the sample space is not countable, since in that case we cannot 
list all elements, and we often cannot write down a sum as in (1.3). 

Perhaps you wonder what kind of sample spaces are not countable. Here is 
a major example. 

Example 1.1.1. (Picking a random point from a line segment) Suppose we want 
to make a model which describes the idea of picking a point randomly from the 
unit interval [0, 1]. The natural sample space is simply ft — [0, 1], and it is clear 
that this is an infinite space. Is ft countable? The answer is no, and this can be 
seen as follows. 
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All points uj e Q have a so-called binary expansion. This expansion is defined 
as follows. We can write 






dnjio) 
2 " ’ 



which we can also write as 



LO = .di{uj)d2{uj) • • • , 



where dn{co) is 0 or 1. For some uj, this expansion is not unique, for instance, 

1/2 = .1000 . . . and 1/2 = .0111 But for all ujs which are not a multiple of 2~^ 

for some n, there is a unique binary expansion. 

Now suppose that [0, 1] is countable. Each lu G [0, 1] has a binary expansion, 
and some will have two such expansions, as observed above. Hence the collection 
of binary expansions arising from [0, 1] is countable: we simply consider the ele- 
ments of [0, 1] one by one, and write down the one or two corresponding binary 
expansions. 

Let us denote the collection of binary expansions by . . ., so that each 

uj^ is in fact an infinite sequence of Os and Is: 

LU^ — (cj J , 6^2 , c<;3 , . . .). 

Now we define a special sequence uj* as follows: 
w* = 

This looks perhaps a little weird, but the only thing we do is to take the nth 
element of and put 1 minus this nth element on the nth position of a;*. Now 
have a careful look at the newly constructed cu*. Since cj* = 1 — cul 7^ a; J, it is 
clear that u* ^ Actually, since cj* = 1 — we see that the nth coordinates 
of cu* and are different. Hence a;* ^ for all n = 1,2, — But now we have 
reached a contradiction: we have found a binary expansion which is not one of our 
original cj^’s. Hence our original assumption is wrong, and we conclude that [0, 1] 
is not countable. □ 

A set which is not countable will be said to be uncountable. Hence we have 
proved that [0, 1] is uncountable. This means that the theory of the preceding 
chapters can not be used. In the next section we shall see that this has serious 
consequences. 



1.2 An Event Without a Probability?! 

Perhaps the fact that certain sample spaces are not countable will not really ter- 
rify you. Of course, we can no longer use the idea of expressing probabilities as 
the sum of the appropriate probability masses, but perhaps there is a better, or 
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more general way of defining and computing probabilities. For instance, when you 
pick a random point from a line segment, you could start with the idea that the 
probability that this point ends up in a given interval, should depend only on the 
length of this interval. 

As an example, consider picking a point randomly from [0, 1]. The probability 
that the chosen point lies in [0, j] should be j. Indeed, the probability that the 
point ends up in any given interval of length r should be r. 

More generally, one could hope for some way of assigning a ‘length’, ‘area’ 
or ‘volume’, V{A) say, to any subset A of a sample space Q, and then defining the 
probability of A as 



P{A) = 



V{A) 

V{QY 



The goal of the current section is to convince you that there is, in fact, a serious 
and fundamental problem with this approach. The following striking example tells 
us that there is no hope that such an approach could work without problems. 



Example 1.2.1. (Banach-Tarskii paradox) Suppose that we want to select a ran- 
dom point from the surface of the unit sphere in three dimensions. We have no 
machinary yet to define a probability measure in this context, but it seems rea- 
sonable to do the following. For any subset A of the surface of the unit sphere, we 
would like to have the property that the probability that the chosen point is in 
A should be proportional to the area of A. In other words, if the set A becomes 
twice as big, the probability of finding our chosen point in A should also become 
twice as big. 

Agreeing on the reasonableness of this idea, we now turn to the Polish math- 
ematicians Banach and Tarskii. They contracted a most remarkable subset of the 
surface of the unit sphere. This subset, we call it A, has the following property. 
One can find rotations A' and A" of A on the surface of the sphere, in such a way 
that A, A' and A" are pairwise disjoint, and in addition, such that A U A' U A" 
consists of the whole surface. Since the areas of A, A' and A" are all the same, it 
seems very reasonable, and inescapable, to assign probability 1/3 to each of the 
sets. In particular, the probability that our chosen point falls in A should be equal 
to 1/3. 

So far so good, sets A, A' and A" with these properties are easy to find. But 
now comes the unbelievable fact. It is also possible to find rotations J5, B' and 
B" of the same set A, in such a way that the collection A, is pairwise 

disjoint and together covers the whole surface! Since these four sets are again 
rotations of each other, they should all have the same probability, and hence this 
probability must now be 1/4. In particular, the probability that our chosen point 
falls in A should be 1/4. This contradicts our earlier observation, and we are in 
deep trouble. □ 

It seems almost magic that a set A as in Example 1.2.1 can possibly exist, 
but yet, it does. This very counterintuitive example shows that it may, after all. 
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not be so easy to assign probabilities to all subsets of a given sample space. In 
the example, it is impossible to assign a probability to A if we insist on rotation 
invariance; the set A has no well-defined area. 

What to do? We shall resolve this problem by insisting that the collection of 
events which receive a probability has a certain structure; not all sets will receive 
a probability. This is the only possible solution, we have to restrict the collection 
of events. 



1.3 Random Variables on Uncountable Sample Spaces 

We end this Intermezzo with some more motivation for the next chapter. We have 
noted in this Intermezzo that very natural sample spaces can be uncountable, and 
that in such an uncountable sample space, certain strange subsets can exist to 
which we can not assign a probability. This leads to serious complications if we 
want to define random variables on such sample spaces. As in Chapter 2, we would 
like to define a random variable X as a function 

X 

with the property that we can speak of the probability that X takes a certain 
value a, or the probability that a < X < b, for instance. Hence, we would like, for 
instance, to assign a well-defined probability to sets of the form 

{to eLt: a < X{uj) < b}. 

But how do we know that this set in fact can be given a well-defined probability? If 
the sample space ft is uncountable this is not always clear, as was illustrated with 
the Banach-Tarskii paradox. This is one of the fundamental issues in probability 
theory. 

The classical approach to attack this problem is to use measure theory to 
decide which sets receive a probability. But we are writing this book for students 
who haven’t studied measure theory. Therefore, we follow a different approach, and 
only assign probabilities to sets for which certain Riemann integrals over these sets 
exist. This approach has the drawback that we have to restrict our sample spaces 
to but in a first and second course in probability, this is no real restriction. At 
the end of the book, in Chapter 9, we will sketch the more general framework of 
probability theory provided by measure theory. 




Chapter 5 

Continuous Random Variables 
and Vectors 



We have seen in the Intermezzo that there is a need to generalise the notions of 
an experiment and a random variable. In this chapter we suggest a setup which 
allows us to do this. As in the first two chapters, we first define experiments, and 
after that random variables. The theory in this chapter is built up similarly as the 
theory in the first two chapters. 



5.1 Experiments 

In the discrete theory, the sample space ft was either finite or countably infinite. 
In the present context, this no longer suffices and we take H that is, our 

experiments will have outcomes in for some d G N. 

An experiment in the discrete setting consisted of a sample space and a 
probability mass function, assigning a certain probability to each outcome in the 
sample space. The probability of an event A in the discrete setting could then 
simply be defined as the sum of all probabilities of elements in A. In the current 
continuous setting, this is impossible, but we can do something similar, replacing 
sums by appropriate integrals. Here is a first example of how this could work. 
After the example, we give the formal definitions. 

Example 5.1.1. (Choosing an arbitrary point from the imit interval) Suppose 
we want to model the choice of a completely random point in the interval (0, 1). 
There are uncountably many points in (0, 1), and hence we cannot list its elements. 
Instead of concentrating on the event that the chosen point is equal to a certain 
given element of (0, 1), we consider the event that the chosen point falls into a 
subinterval I C (0, 1). It is reasonable to assume that this probability should be 
equal to the length of I. Writing |/| for the length of an interval 7, we therefore 
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define the probability of I as 

p{i) = 1 ^ 1 - 

You should think of this number as the probability that a completely random 
point in the unit interval ends up in I. 

Another way of formulating this assignment of probabilities is the following. 
Define the function 

X _ f 1 if 0 < X < 1, 

— I Q elsewhere. 

Now observe that 

P{I) = jj{x)dx, 

that is, we have written the probability of an interval / in terms of an integral. 
When we write things this way, there is no need to restrict ourselves to intervals 
contained in (0, 1). Indeed, since f{x) = 0 for all x ^ (0, 1), the part of an interval 
outside the unit interval does not contribute to the probability of the interval. 

Formulated this way, we can also define the probability of other subsets of 
R. Indeed, we can define 

P{A) = [ f{x)dx, (5.1) 

J A 

as long as the integral f{x)dx exists. This last requirement is necessary, since 
for instance 

f f{x)dx 

Jq 

does not exist, and therefore P(Q) remains undefined. So not all subsets of R have 
a probability now, something we already anticipated in the Intermezzo. □ 

When you compare (5.1) to the corresponding formula 

PA) = Y p^^) 

xeA 

in the discrete theory, then you see that the two expressions are very similar. The 
sum has been replaced by an integral, and the probability mass function has been 
replaced by the function /. 

In the discrete theory, a different experiment required a different probability 
mass function. Here, in the current continuous setting, a different experiment 
requires a different function /. This is illustrated by the next example. 

Example 5.1.2. (Darts) Suppose we play darts on a big circular board with radius 
1 meter. Suppose that we are only interested in the distance between the arrow 
and the midpoint of this board, not in the exact position of the arrow. 

What is the probability that the distance between the arrow and the midpoint 
is at most t, for 0 < t < 1? To compute this, we assume that we hit the board in 
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a completely random place and that we never miss the board. These assumptions 
imply that the probability of hitting the board within the circle of radius t is the 
area of this smaller circle with radius t, divided by the area of the full circle with 
radius 1. These areas are and tt respectively, so the probability that the arrow 
is less than t away from the midpoint is 




7T 



This means that the outcome of our experiment, that is, the distance from the 
arrow to the midpoint, is between 0 and t with probability and we can write 
this in integral form as follows: 



P{(0,t)) = (" = 



/' 



2xdx^ 



for all 0 < t < 1. It is reasonable now to define 



. . _ f 2a: if 0 < X < 1, 

\ 0 elsewhere, 

and define 

P{A) = [ g{x)dx, 

JA 

for all .A C M for which this integral is defined. □ 

In the previous example, we reduced the problem to one dimension by looking 
only at the distance between the arrow and the midpoint of the board. It is, 
however, also possible to study this example in two dimensions: 

Example 5.1.3. (Darts revisited) Consider the setup of the previous example, but 
now we are interested in the exact position of the arrow, not only in the distance 
to the midpoint. Then the probability that the arrow ends up in a given region A 
should be proportional to the area of A. We formalise this as follows. Since the 
area of the full circular board is tt, we define 




if < 1, 

elsewhere. 



Probabilities can now be defined as 




f{x,y)dxdy, 



for all A for which this integral is defined. In fact, this assignment of probabilities 
corresponds to the experiment of choosing an arbitrary point in the disc with 
radius 1. □ 
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We are almost ready now for the definition of an experiment in the current 
continuous context. Before we give this definition, we first need to define the 
analogue of probability mass functions in the continuous setting. This analogue is 
called a probability density function and is defined as follows. 

Definition 5.1.4. We call a non-negative function / : ^ R a (d-dimensional) 

probability density function if 




f{xi,...,xd)dxd-'dxi 



exists for all ai < bi, . . . ,Od < bd, and 




f{xi,...,xd)dxd'"dxi = 1 . 



Often, we will simply call this a density. 

Definition 5.1.5. A (d-dimensional) experiment is a sample space = E^, together 
with a (d-dimensional) density /. For any A C for which 




f{xi,...,xd)dxi--dxd 



is defined, we define the probability of A by 



P{A) = J-J f{xi,...,Xd)dxi‘-dxd. 

As before, P is called a probability measure, and any subset of Q that has a 
probability, is called an event. 

Note that according to this definition, not all subsets are events. 

In principle, any positive function satisfying the defining properties of a den- 
sity gives rise to an experiment. However, not all experiments that arise in this 
way are meaningful from a probabilistic point of view. Typically, a density is only 
relevant if it can be associated with a ‘real’ probabilistic situation. For instance, 
we already saw that the density / given by f{x) = 1 for 0 < x < 1 and f{x) = 0 
elsewhere, corresponds to choosing a random point from the unit interval. The 
densities in the other examples also had an obvious probabilistic interpretation. 

We conclude this section with one more example, the exponential density 
function. 



Example 5.1.6. 

density function 



(Waiting for an event to happen) Let A > 0 and consider the 




— Xx 



f{x) = 



if X > 0, 
elsewhere. 
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Since \e~^^dx = 1, / is indeed a density. This density is very often used to 
model the waiting time for a certain event to happen, like the waiting time for the 
next customer to arrive in a shop, or the waiting time for the next earthquake in 
California. We now sketch the reason for this. The argument goes via a suitable 
discretisation of the time axis, as follows. 

Suppose we flip a coin at time epochs e, 2e, . . ., and that Y is the waiting 
time for the first head. In the above terminology, the event we are waiting for is 
the occurrence of heads. Then Y/e has a geometric distribution: 

P(y>fce) = (l-p)^ 

where p is the probability of heads at each coin flip. Recall that E{Y/e) = p~^ ^ 
and therefore 

Now flx a time t > 0, and let e | 0. If we would keep the same p, then clearly the 
number of successes by time t would go to infinity, since the number of trials is 
roughly t/e. So in order to compensate for the fact that the number of trials up 
to time t goes to infinity, we choose the success probability p in such a way that 
the expectation of Y is independent of the discrete time steps e. More precisely, 
we choose p so as to satisfy p/e = X. Then we can write 

P{Y>t) = 

= (1 - Ae)‘/^ 




as 6 — > 0. Hence in the limit as £ ^ 0, the probability that the waiting time for 
the first success is at least t is given by 



1 — e 



— Xt 



We can now set up an experiment (with outcome the waiting time) such that the 
probability that this waiting time is at least t is equal to 1 — for all t > 0. 
A little thought reveals that we can do this by choosing the exponential density 
function f{x) = since 



i: 



Xe ^^dx = 1 - e 



—Xt 



for all t > 0. The probability that the waiting time is between a and 6, for instance, 
can now be computed as 



/ 



\e ^^dx. 



□ 
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5.2 Properties of Probability Measures 

In the current continuous setting, probabilities are expressed in terms of Riemann 
integrals. It is perhaps a good idea to recall the dominated convergence Theorem 
3.2.4 in the higher dimensional setting, since we will use this result a number of 
times below. 



Theorem 5.2.1. (Dominated convergence) Let /, /^, i = 1, 2, . . . be positive inte- 
grable functions on with fi{x) f{x) as i oo, for all x = (xi, . . . , 
Suppose in addition that there exists a function g with / • • • g(x)dx < oo such 
that fi{x) < g{x), for all x and all i. Then 

lim /••• / fi{x)dx = / f{x)dx. 

J jRd J J]^d 



In the following lemma we collect some properties of probability measures. 
In what follows, when we write P{A), this implicitly means that the probability 
of A is defined, that is, f{x)dx exists for the appropriate density /. Recall the 
definition of the indicator function of an event E: 







1 

0 



if X e E, 

if X ^ 



Lemma 5.2.2. Consider an experiment with sample space and density f. 

(a) For events t4i, ^ 2 , . . . which are pairwise disjoint, we have 



( oo 
i=l 






(b) P{A‘^) = 1- P{A). 

(c) If AC B, then P{A) < P{B). 

(d) P{A UB) = P{A) + P{B) - P{A n B). 

(e) P{n) = 1 and P(0) = 0. 

(f) For events A\ C A 2 C ■ ■ ■ such that A = is also an event, we have 



P{A) = lim P{Ai). 

i—^oo 



Proof For (a), observe that for disjoint events Ai, An we have 



P[[jA., 



1=1 



(5.2) 



as follows from elementary additivity properties of the Riemann integral. Next, 
consider an infinite sequence Ai, ^ 2 , • • • of disjoint events such that is an 
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event. Then we have 



lim P 

n— >CX) 







where the one but last equality follows from Theorem 5.2.1. Hence the result 
follows by sending n to infinity on both sides of equation (5.2). 

For (b), we write P{A) — J^f{x)dx = /(x)lyi(x)dx, and similarly 
P{A^) = J^^f{x)lAc{x)dx. Hence, 



P{A) + P{A^) = 



f f{x)lA{x)dx + f f{x)lAc{x)dx. 
jR<i Jr<‘ 

= [ f{x){lA{x) + lA<=(x))dx 

JR‘‘ 

= [ f(x)da 

jRd 



ix = 1. 



Properties (c)-(e) also follow from elementary properties of the Riemann integral; 
see the forthcoming Exercise 5.2.3. 

For (f) we apply Theorem 5.2.1 as follows. Define 



hi{x) = lAiipc) and h{x) = 1a(^)- 



The assumptions imply that 



0 < hi[x) t h{x) 

as i ^ 00 . Now h{x)dx < oo and hence, according to Theorem 5.2.1 we have 

/ hi{x)dx t / h{x)dx. 

JRd JRd 

The result now follows, since hi{x)dx = P{Ai) and h{x)dx = P{A). □ 

4 Exercise 5.2.3. Give the details of the proof of (c)-(e). Write down exactly what 
properties of the Riemann integral you use. 
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5.3 Continuous Random Variables 

The motivation to introduce random variables is the same as in the discrete case: 
often, we are not really interested in the actual outcome of an experiment, but 
in some function of this outcome instead. We first give the formal definition of 
a (continuous) random variable, and illustrate this definition with a number of 
examples. Recall the fact that our sample spaces are f] = R^, for some integer d. 

Definition 5.3.1. Consider an experiment with sample space Q and probability 
density function /. 

(a) A random variable X is a mapping from to R such that the set 

^uj G : u ^ ^ 

is an event, for all —oo<a<b<oo, where any < can be replaced by <. 

(b) A random variable X is called a continuous random variable with density g if 
^ is a density function and 



P{a<X <b) = P{u:a< X{uj) <b)= j g{x)dx, (5.3) 

J a 

for all —00 < a < b < oo. 

(c) A random variable X is called a discrete random variable if there is a countable 
subset C of R with P{X E C) = 1. 

Let us first make a number of remarks about Definition 5.3.1. 

1. The definition in (a) expresses the idea that we want to be able to talk about 
the probability that X takes values in intervals. So for instance, we want to 
be able to talk about the probability that X takes a value between 5 and 12. 
Elementary properties of the Riemann integral then imply that many other 
probabilities are also well defined, for instance the probability that X takes 
a value between either 6 and 7 or between 9 and 10. 

2. For a continuous random variable X, the probability that X lies between 
a and b is specified through an integral. This implies that the probability 
that X takes a value in a certain interval does not change when we include 
or exclude endpoints of the interval. It also implies that for a continuous 
random variable X, P(X == a) = 0 for all a. 

3. Note that a continuous random variable X does not have a unique density. 
For example, if we have a density of X, then we can change this density in 
a single point to obtain another density of X. We come back to this issue in 
a moment, after we have defined distribution functions. 

4. Perhaps you wonder why continuous random variables with a given density 
g exist. In fact, this is easy to see. If we start with an experiment = R 
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with density g and define X through 

X{uj) = LU, 

then X is a continuous random variable with density g. This construction 
is often used, but we shall see soon that interesting random variables are 
certainly not always constructed like this. 

Definition 5.3.2. The function 



Fx{x) = P{X<x) 



is called the distribution function of the random variable X. 

The following lemma implies that in order to show that a certain random 
variable is continuous, it suffices to show that its distribution function has the 
right form. The point of the lemma is that we do not assume from the outset that 
X is a continuous random variable. Rather, this follows as a conclusion. 

Lennna 5.3.3. Suppose that X is a random variable whose distribution function 
can be written as 

x)^ ( f{y)dy, 

J —oo 

for some density f. Then X is a continuous random variable with density f. 

Proof. First we prove that P{X = a) = 0 for all a. To see this, note that by 
Lemma 5.2.2(a) we have that 

P(X < a) = P(X < a) + P(X = a) 



and therefore, 

P(X = a) = P(X < a) - P(X < a). (5.4) 

Since 

{X<a}=0{x<a-l|, 

it follows from Lemma 5.2.2(f) that 

P{X < o) = lim P{X <a- 1/n) 

n^oo 

= lim Fx{ci — 1/n) 

n— »>oo 

= Px(a)=P(X<a), 

where the one but last equality follows since Fx is a continuous function by as- 
sumption. It now follows from (5.4) that P(X = a) = 0. 
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Let a <b. Since we can write 

{(jO : X{(jj) <b} = {uj: X{uj) < a} U {uj : a < X{uj) < 6}, 
a disjoint union, we have 

P{X < 6) = P{X <a)pP{a<X < b). 

This leads to 

P{a<X <b) = P{X <b)~ P{X < a) 

/ b na 

f{x)dx- / f{x)dx 

-oo J — oo 



= f 



f{x)dx, 



as required. Since P{X = a) = 0 for all a, we can interchange < and < signs, and 
we have proved the lemma. □ 

Asking for the distribution of a continuous random variable is asking for 
either its distribution function or its density. 

Let X be a continuous random variable with distribution function Fx, and 
suppose that Fx is differentiable on (a, b). It follows from the fundamental theorem 
of calculus that 



Since 



Fx{b)-Fx{a)= f/x{t)dt. 



P{a<X <b) = Fx{b)-Fx{a), 



this shows that f{t) = -^Fx{t) is a density of X on (a, 6). Our distribution func- 
tions will always be differentiable with at most finitely many exceptional points. 
Hence we can assume that the density fx and the distribution function Fx are 
related via 

fx{x) = -f^^xix), (5.5) 

for all X, with at most finitely many exceptions. Here follows a number of examples. 

Example 5.3.4. (Uniform distribution) The random variable X is said to have a 
uniform distribution on [a, b] if its density is given by 



fi^) = 



b-a 



for a < X < 6, and /(x) =0 otherwise. A random variable with a uniform distri- 
bution on [a, b] can be interpreted as the outcome of a completely random point 
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from [a, 6]. (The interval [a, 6] need not be closed. If X has a uniform distribu- 
tion on (a, 6), then f(x) = l/(b — a) for all x G (a,b). This difference is of little 
importance.) 

The distribution function of X is given by 

{ 0 if X < a, 

Iff if a < X < 6, 

1 if X > b. 



To see how we can compute probabilities now, take for instance a = 0 and 
b = 2, so that X has a uniform distribution on (0,2). We compute P(X > ^) as 
follows: 



^(^>0 = l"f{x)dx 

= = l 

7i 2 4 □ 

4 Exercise 5.3.5. Check that fx{^) = for all but two values of x in this 

example. 

Example 5.3.6. (Normal distribution) The random variable X has a standard 
normal distribution if its density is given by 

f(x) = ^ 



for all X G M. The first thing to check is that this is indeed a density. The integrand 
cannot be integrated directly, but there is a nice trick from calculus which enables 
us to compute f{x)dx. The trick consists of computing the product of two 
such integrals, and then using polar coordinates: 





f{x)dx 



2 ,r f 

Jy= — 00 J X= — 00 
^ /*27T I'OC 

J 9=0 J r=i 



e 2 (^+ 2 / )(ixdy 



re 2^ drdO = 1. 



//G 



The random variable X is said to have a normal distribution with parameters 
R and > 0 if 



/W 



V27Ta‘^ 



_ 1 {^-^ 1: 

O 2 



for all X G M. Note that for /i = 0 and = 1, this reduces to the density of a 
standard normally distributed random variable. 



4 Exercise 5.3.7. Show that this is a density, using the fact that the density of 
the standard normal distribution integrates to 1. 
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The reason that the normal distribution is very important became already 
apparent in the central limit Theorem 4.2.1. Although we did not call it by name 
at that moment, the limit really is expressed as a probability in terms of a standard 
normal distribution. In the next chapter, a much more general statement will be 
proved, again involving the normal distribution. □ 

Example 5.3.8. (Exponential distribution) The random variable X has an expo- 
nential distribution with parameter A > 0 if its density is given by 

f{x) = 



whenever x > 0, and f{x) = 0 for x < 0. We have come across this density already 
in Example 5.1.6, where it naturally appeared as the density corresponding to 
the waiting time for an unpredictable event. In that example, we first derived the 
distribution function of the waiting time. Indeed, we showed that in the limit for 
6 — > 0, the probability that the waiting time is at least t is equal to 1 — e~^^. 
Hence, the distribution function of the waiting time is F{t) = for t > 0. 

When we differentiate this, we find the density of the exponential distribution, in 
agreement with (5.5). □ 

Example 5.3.9. (Cauchy distribution) The random variable X has a Cauchy dis- 
tribution if its density is given by 



f[x) = 



1 1 
7T 1 + X^ ’ 



for all X € M. Here is a fairly natural example where the Cauchy distribution 
arises. 

Consider the sample space M with a uniform density on (— f , f )• We select 
a point 0 in this sample space, and construct a random variable X as follows. We 
draw a half-line in the plane, starting in (0, 1) in direction © downwards. More 
precisely, 0 is the angle between the line and the y-axis, where 0 itself is uniformly 
distributed on (— f , f )• 

Denote the intersection of the line with the x-axis by (X, 0). A little thought 
reveals that X = tan0. We claim that X is a continuous random variable with a 
Cauchy distribution. To see this, first observe that for any — tt/2 < a < tt/ 2, we 
have 



P(0 <a) = 



(-V2) ^ 1 a 

7T 2 7T 



Hence, 



P(X < x) 



P(tan 0 < x) 

P(0 < arctanx) = ^ — arctanx, 

2 7T 



and differentiation now leads to the desired result. This is an example of a random 
variable X which is not the identity map. □ 
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Example 5.3.10. Consider the darts Examples 5.1.2 and 5.1.3. We can reformulate 
this now in terms of continuous random variables. Indeed, the sample space in 
Example 5.1.3 is and the density is f{x,y) = 1/tt when < 1, and 

f{x,y) =0 otherwise. We can define a random variable X on f] by 



X{{x,y)) = 

Then X represents the distance to the midpoint, and we have seen in Example 
5.1.2 that X is a random variable with density f{x) = 2x, when 0 < x < 1 and 
/(x) = 0 elsewhere. In fact, in Example 5.1.2 we first obtained the distribution 
function, and derived the density from this by, indeed, differentiation, in agreement 
with (5.5). □ 



5.4 Expectation 

The concepts discussed in the first four chapters all have a counterpart here. 
When comparing results and definitions with the discrete theory, the general rule 
of thumb is that we replace probability mass functions by densities, and sums by 
integrals. For the expectation of a random variable, this leads to the following 
definition. We call the integral g(x)dx well defined if (1) |^(x)|dx < oo, 

or (2) g(x) > 0 for all x, or (3) g(x) < 0 for all x. 

Definition 5.4.1. The expectation E{X) of a random variable X with density / is 
defined by 

/ + 00 

xf{x)dx, 

-oo 



whenever this integral is well defined. 



Maybe you are not convinced yet that this is the right definition of the 
expectation. So let us explain why this definition is reasonable. For simplicity, 
we assume that the continuous random variable X is bounded and non-negative: 
0 < X < K ior some integer K. The argument goes via an approximation of X, 
as follows. 

Let, for any integer n, Xn be defined as 






k 

n 



if 



k 

n 



<X< 



fc + 1 
n 



which implies that Xn is a discrete random variable. The expectation of Xn can 
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be computed as follows: 

E{Xn) 



"f:' ^-p ^ 

^ n \ n 

k=o ^ 



n 



n \n 
k=o ^ 

- fx{x)dx 

'^Jk 



where 



We then note that 



k=o 

.K 

= / Sn{x)fx{x)d2 

Jo 



k k k + 1 

Sn(x) = — it — < X < . 

n n n 



/ oo 

xfx{x)dx) 

-OO 



< 



/ OO 

{s„(a;) - x}fx{x)da 

-OO 

/ OO 

|s„(a;) -x\fx{x)dx 

-OO 

/ OO 

fx{x)dx 

-OO 



-OO 

I poo 
< - 
n 



1 

n 

so that E{Xn) — ^ J^^xfx{x)dx. Since Xn X as n oo and E{Xn) 
J^oo ^fx{x)dx, it is reasonable to define E{X) = xfx{x)dx. 

The expectation of a random variable can take the values ±oo, we shall see 
examples of this. 

Example 5.4.2. (Exponential distribution) For a random variable X with an expo- 
nential distribution with parameter A we can compute the expectation as follows: 



E{x) = r 

Jo 



Xxe ^^dx 



= 1/A. 



□ 



Expectations of continuous random variables share the properties of their 
discrete counterparts. At this point, we only note the following. More properties 
follow in Section 5.8. 



Theorem 5.4.3. When E{X) exists, we have E{aX b) = aE{X) + b, for all 
a^b ^ M. 
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Proof. Since we have defined the expectation via the density, we need to compute 
the density of F = aX + b. To do this, we suppose first that a > 0 and write. 



P{Y <y) = P[X < 



r{y-b)/a 

fx{x)dx 

— OO 



by the substitution u = ax -\-b. Hence, the density of Y evaluated at u is given by 
-fx[ ) • This implies that the expectation of Y is equal to 



E{aX + b) 



a \ a J 



j fx{x)adx 

/ OO nOO 

bfx{x)dx + a / xfx{x)dx 

-OO J —OO 



= b + aE{X), 

by the substitution u — b = xa. □ 

4 Exercise 5.4.4. Give the analogous proof for a < 0. 

Example 5.4.5. (Uniform distribution) Recall that a random variable X is said to 
have a uniform distribution on [a, b] if its density is given by 



fix) = 



b-a 



for a < X < 6, and f{x) =0 otherwise. Its expectation is given by 



E{X) = f 
J a 



Ja b-a 

a b 



Example 5.4.6. (Normal distribution) Recall that the random variable X has a 
standard normal distribution if its density is given by 
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for all X G M. It is easy to see, using the symmetry of / around 0, that the 
expectation of X is equal to 0. The random variable X was said to have a normal 
distribution with parameters /i G M and > 0 if 



fix) 




1 

2 a 



for all X G M. We can find the expectation of X via the following computation, 
writing a = V^. Let 

Y=^. 

a 

We claim that Y has a standard normal distribution. To see this, we can write 



P{Y < y) 



P{x <ya + fi) 



1 n 

j_ r 

J- 



e 2 









where the last step follows from the substitution v = {x — p)/ a. This shows that 
Y has a standard normal distribution, so that E(Y) = 0. Since X = aY + p, this 
leads via Theorem 5.4.3 to E{X) = p. □ 

Example 5.4.7. (Cauchy distribution) Recall that the random variable X has a 
Cauchy distribution if its density is given by 



fix) 



1 1 

7T 1 + X^ ’ 



for all X G M. We claim that X has no expectation. The reason for this is that 



1 

7T 




X 

1 + X^ 



dx 



does not exist. (Can you verify this?) □ 

We end this section with an envelope problem where continuous random 
variables play an important role. 

Example 5.4.8. (Second envelope problem) Suppose that I show you two envelopes, 
both containing a certain amount of money. You have no information whatsoever 
about the amounts of money in the envelopes. (This is different from the first 
envelope problem in Example 2.5.10, where you had the information that one 
envelope contained twice the amount of the other.) You choose one and this time 
you do open the envelope. Suppose you see that the envelope contains x euros, 
say. After having seen this amount, I offer you the opportunity to swap, that is. 
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to choose the other envelope. Does it make sense to do so? More precisely, is 
there a decision algorithm that enables you to end up with the highest amount of 
money, with probability strictly larger than It is perhaps surprising that such 
an algorithm does exist, and we describe it here now. 

Suppose that the amounts in the two envelopes are a and 6, with a < b. For 
the moment we assume that a and b are non-random. We denote the amount in 
the chosen envelope by X\. This means that P{Xi = a) = P{Xi = b) = ^. The 
algorithm makes use of an auxilliary random variable T, which is independent of 
Xi and which has an exponential distribution with parameter 1 say. (In fact, the 
precise distribution of Y is of little importance, we will see in a moment what 
properties this distribution should have for our purposes.) The algorithm runs 
as follows. After looking at the amount Xi, we go to our computer, and draw a 
realisation from Y. If Xi < Y then we swap, if Xi > Y then we don’t. Quite 
simple, isn’t it? Why does this work? Let us denote the amount of money in 
the final envelope (after possible swapping) by X2. We can write down the joint 
distribution of (Xi,X2). Indeed, P{X2 = b\Xi = b) = P{Y < b) = 1 - e~\ and 
P{X2 = b\Xi = a) = P{Y >a)= e~^. This leads to 



P{X 2 = b) 



P{X2 = b\Xi = b)P{Xi =b) + P{X2 = b\Xi = a)P{Xi = a) 



This is a rather surprising conclusion. There is a winning stategy in this case. □ 



4 Exercise 5.4.9. Investigate what happens when we assume that the amounts in 
the two envelopes are themselves random variables. 

^ Exercise 5.4.10. Suppose that we have no information at all about the amounts 
in the two envelopes. In particular, we do not know whether the amounts are 
random, independent, etcetera. What kind of distributions of Y would be useful 
for the decision making procedure? 



5.5 Random Vectors and Independence 

As in the theory of discrete random variables, if we want to study the interaction 
between various random variables, we need to look at random vectors. 

Definition 5.5.1. Consider an experiment with sample space f] and density /. 

(a) A function X = (Xi, . . . , X^^) from fi into is called a random vector if 

{uj : Oi < Xi{uj) <bi,i = l,...,d} 

is an event, for all —00 < Oi < bi < 00. 
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(b) A random vector X = (Xi, . . . , Xd) is called a continuous random vector with 
(joint) density g if 



rbi nbd 

P{ai<Xi<bi,...,ad<Xd<bd)= ■■■ g{xi, . . . ,Xd)dxd 

J ai d ad 



• • • dx\ 



for all -00 < < 00 , i = 1, . . . , d. 



(5.6) 



Definition 5.5.2. We define the (joint) distribution function oi X = (Xi, . . . ,Xd) 
by 

Fx{xi,. . . ,Xd) = P{Xi < xi,. . . ,Xd < Xd)- 

We refer to the distribution of the vector as the joint distribution, and to 
the distributions of its individual components as the marginal distributions. As in 
the discrete case, knowing the joint distribution means also knowing the marginal 
distribution of each of the individual X^’s: 



Theorem 5.5.3. Let X = (Xi,...,X(^) have joint distribution function Fx and 
joint density fx • Then Xi is a continuous random variable with density 



fXiiXi) 




rOO rOO 

J Xi-\— — 00 — oo 

f (^1 5 • • • ? a^d)dxd ’ ' ■ dxi-\-idxi—i 



dx\. 



In words, we find the marginal density of Xi by integrating out all other variables 
from the joint density. 



Proof It follows from the definition that 



P{Xi < x) 



/ ••• / fx{xi,...,xd)dxi--dxd 

/ X / poo pOO \ 

(/ •••/ f{xi,...,xd)dxi---dxi-idxi+i---dxdjdxi 

-oo \J oo J — oo / 



from which the result follows, using Lemma 5.3.3. 



□ 



Example 5.5.4. Let (X, F) have joint density 

f{x,y) = 

y 



for x,y > Q, and f{x,y) = 0 otherwise. The marginal density of Y can now be 
found as follows: 



fy{y) 




f{x,y)dx 

-e~^~ydx 

y 
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for all y > 0. We conclude that Y hais an exponential distribution with parameter 

1 . □ 

Example 5.5.5. Let (X^Y) have joint density 

f(x,y) = -,0<y<x<l, 

X 

and f{x,y) = 0 elsewhere. We find the marginal density of X by integrating out 

y- 

fx{x)=[ f{x,y)dy = l, 

Jy=0 

where 0 < x < 1. Hence X has a uniform distribution on (0, 1]. The density of Y 
is equal to 

fy {y)= / f{x,y)dx = - log y, 

Jx=y 

for 0 < y < 1. We have no particular name for this distribution. □ 

Recall that discrete random variables X and Y were said to be independent 
if 

P{X = x,Y = y) = P{X = x)P{Y = y). (5.7) 

This definition is useless for continuous random variables. To see this, recall that 
for any continuous random variable X we have 



P[X = x) = 0, 



for all X. Hence (5.7) says that 0 = 0, and this is a useless condition. 

In the continuous context, there are several (equivalent) ways to define in- 
dependence. We make a somewhat arbitrary choice, but note the forthcoming 
Exercise 5.5.8. Compare the following definition to Exercise 2.2.4. 

Definition 5.5.6. The random variables X and T, defined on the same sample 
space, are said to be independent if for all x and y, 

P{X < X, y < y) = P{X < x)P(Y < y). 

4 Exercise 5.5.7. State a suitable definition of independence for n random vari- 
ables Xi, . . . , Xn, defined on the same sample space. 

4 Exercise 5.5.8. Show that the continuous random variables X and Y with joint 
density / are independent if and only if there are densities fx of X and fy of Y 
so that 

fx{x)fY{y) = f{x,y), 



for all X, y G M. 
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4 Exercise 5.5.9. Show that the continuous random variables X and V are inde- 
pendent if and only if they have a joint density f{x^y) which can be written as 
a product f{x,y) = g{x)h{y) of a function of x alone and a function of y alone. 
(We say, as before, that / factorises.) To prove this, look at the proof of Theorem 

2.4.8. 

Example 5.5.10. Suppose that X and Y have joint density f{x,y) = for 

x,2/>0, and f{x,y) = 0 elsewhere. Then f{x,y) factorises as 

f{x,y) = e~^e~y, 



and we conclude that X and Y are independent. 



□ 



Example 5.5.11. Consider Example 5.5.5. Note that X and Y are independent, 
although it appears that the joint density is a product of a function of x alone and 
a function of y alone. Do you see why? □ 

Example 5.5.12. (Bivariate normal distribution) Suppose that (X, F) is a random 
vector with density 






1 

27ry'(l -p2) 






where p is a parameter taking values strictly between —1 and 1. This expression 
looks rather obscure and unattractive, but the point of this density will become 
clear now. 

First, we need to convince ourselves that / is a density function, that is, 
we need to verify that J f f{x^y)dxdy = 1. We do this with a trick: we use our 
knowledge of one-dimensional normal distributions. A little algebra shows that 



f{x,y) 






- P 2 ( 1 - 7 ^) 

V'27t(1 - 



We now integrate the right-hand side first over x and then over y. Integration 
over X leaves the second term unchanged, and in the first term we recognise the 
density of a normal random variable with parameters py and 1 — p^. Hence when 
we integrate over x, the first term gives just 1, and we are left with the second 
term. However, the second term is the density of a standard normally distributed 
random variable, and therefore also integrates to 1 (integration is over y this time). 
Therefore, we find that f{x,y)dxdy = 1 as required. 

Next, we want to compute the marginal distribution of X and Y. Well, this 
we have, in fact, already done in the previous calculation. Indeed, we showed 
that when we integrate out x, the result is the density of a standard normal 
distribution, and therefore, Y has a standard normal distribution. Since f{x,y) is 




5.6. Functions of Random Variables and Vectors 



107 



symmetric in x and y, it follows immediately that also X has a standard normal 
distribution. Note that is true for all choices of the parameter p, and this is a very 
natural example in which the same marginal distributions have a different joint 
distribution. As in the discrete setting, the joint distribution does determine the 
marginals, but not the other way around. □ 



5.6 Functions of Rcmdom Variables and Vectors 

Given a random variable X or a random vector (X, F), and an appropriate function 
p, when is g{X) or g{X,Y) a (continuous or discrete) random variable or vector? 
For instance, is X + F a random variable, when X and Y are? 

First let us have a look at the one-dimensional case. Let X have density /. 
If we want to compute the distribution function of g{X) we would like to write 

P{g{X) <y) = P{Xeg-\-<^,y]). (5.8) 

For this to make sense in our present context, {X G g~^{—oo,y]} must be a 
set which has received a probability. More generally, we want sets of the form 
{X G g~^{a,b)} to have a well-defined probability for all a < b. This is typically 
the case, and we illustrate this with some examples. 

Example 5.6.1. Let X be a continuous random variable with differentiable distri- 
bution function Fx and density fx = -^Fx{x). Let g{x) = 2x + 3. Then 

P{a<g{X)<b) = 



which means that g{X) is a continuous random variable with 
fg{x){y) = Ifxiiy - 3 )/ 2 ). □ 

Example 5.6.2. Let X have a standard normal distribution, and let g{x) = x^. 
Then, writing $(x) for the distribution function of X, we have for 0 < a; < y, 

P{x < g{X) <y) = P{\/x < X < + P{-s/y < X < -y/x) 

= $(Vy) - - ^(-Vy) 

= 2$(Vy) - 2$(V^), 

since $(x) = 1 - $(-x). Since P{g{X) <y) = P(0 < ^(X) < y), differentiating 
leads to 



P(o < 2X + 3 < b) 



i 

[ 



2 

(6-3)/2 



(a-3)/2 



fx{x)dx 



fx 



2 ) 2 ^^’ 
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Hence ^(X) is a continuous random variable, with density 



fg(x)(y) = 




for all y > 0, and fg(x)(y) = 0 for 2/ < 0- D 

In most cases of interest, it is possible to do a similar computation. It is also 
possible to write down a few general results, which reduces the amount of work in 
many cases. 



Theorem 5.6.3. Let X have density f, and let g be one-to-one and differ- 

entiable, with differentiable inverse. Then g{X) is a continuous random variable 
with density 



fg(x){y) = f{9 (y)) 



—g ^(y) 

dy^ 



for all y in the range of g, and fg(x){y) = 0 elsewhere. 



Proof. Without loss of generality, we assume that g is non-decreasing. The result 
is a consequence of the classical change of variable theorem from calculus. Using 
the change of variables z = g{x), we can write for a < 6 such that (a, b) is in the 
range of g\ 



P{a < g{X) < b) 



P{g-\a) < X < g-\b)) 

rg-\b) 

/ f{x)dx 
Jg~Ha) 

rb 



J a 



dz 



g-\z) 



dz, 



proving the result. 



□ 



Example 5.6.4. Let X and ^ be as in Example 5.6.1. We can rederive the result 
from Example 5.6.1, using Theorem 5.6.3 as follows. The inverse g~^{y) is given 
by 



9 \y) 



2/-3 

2 



and hence -^g ^{y) = Substituting this in Theorem 5.6.3 gives the result 
obtained in Example 5.6.1. □ 



For the higher-dimensional case, we state the next theorem in two dimensions. 

Theorem 5.6.5. Let {Xi,X 2 ) have joint density f, and let ^ ^ be one-to- 

one, and write g{xi,X 2 ) = [gi{^i',X 2 )’, 92 {pci^^ 2 )) = ( 2 / 152 / 2 )- Since g is one-to-one, 
it can be inverted as x\ = xi{y\,y 2 ) and X 2 = 3 : 2 ( 2 / 15 2 / 2 )- Let J be the Jacobian of 
this inverse transformation (where we assume enough differentiability) . That is, 



dxi dx2 



dxi dx2 
dy2 dyi ' 
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Then (11,12) = (9(^1), 9(^2)) is a continuous random vector with joint density 

f(Y,,Y2){yi,y2) = fixi{yi,y2),x2{yi,y2))\J{yi,y2)\, 

if {91,92) is in the ran9e 0/9, and f(Yi,Y2){9i^92) = 0 otherwise. 

Proof. The proof is similar to the one-dimensional case and follows from the clas- 
sical change of variables formula from calculus. □ 

Example 5.6.6. Suppose that Xi and X2 are independent, exponentially dis- 
tributed random variables with the same parameter A. We compute the joint 
density of (Ti,l2) where 



Ti = Xi + X2 and Y2 = X1/X2. 



To find this joint density, let ^ be the map defined by ^(xi, X2) = (^:i+X2, X1/X2) = 
{91^92)- The inverse map is then given by 



Xl 



9 192 
1 + ^2 



and X2 = 



9i 

1 + 2/2’ 



and a simple computation shows that the Jacobian is equal to 



J{yi,y2) = -yi/{i + y2f- 



Substituting this in the statement of the theorem gives 



f{Yi,Y2){yi^y2) 



f{XuX2){xi,X2) 



ym 

1 + 2/2 






2/1 

(1 + 2/2)^’ 



2/1 

1 + 2/2 



I2/1I 

(1 + 2/2)^ 



for 2/1, 2/2 > 0. □ 

4 Exercise 5.6.7. Show that li and I2 in this example are independent, and find 
their marginal densities. 



4 Exercise 5.6.8. If X and Y have joint density /, show that the density of XY 
is given by 



/ oo 

f{x,u/x)\x\~^dx. 

-OO 

To do this, it is wise to first compute the joint density of X and XY . 



4 Exercise 5.6.9. Let Xi = aYi + IIY2 and X2 = cYi + dY2, where {Xi,X2) is 
a continuous random vector with joint density /. Show that (Ti,l2) is also a 
continuous random vector with density 



5(2/1, 2/2) = \ad - bc\f{ayi + 62/2,01/1 + <^2/2), 



if ad — bc^ 0. 




no 
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There is one more piece of theory associated with functions of random vari- 
ables. In the discrete context, when X and Y are independent random variables, 
so are g{X) and h{Y), for any functions g and /i, see Theorem 2.2.5. When the 
random variables are continuous, this should also be the case. In order to prove 
this, we need a very weak condition which we will call regularity. Readers who 
are not interested in these details, can safely assume that all functions in this 
book and of practical use are regular, and apply the forthcoming Theorem 5.6.12 
without problems. 

Definition 5.6.10. A function g is said to be regular if there exist numbers • • • < 
a_i < ao < «! < • • • , with ai ^ oo and a_^ — oo when i ^ oc, so that g is 

continuous and monotone on each interval (a^,ai+i). 

Example 5.6.11. The function given by x ^ sinx is regular; all polynomial func- 
tions are regular. An example of a function which is not regular is x ^ 1 q(^)- 

□ 

Theorem 5.6.12. Let Xi, . . . , Xn be independent continuous random variables, and 
I'^t gi‘,g 2 '> • • ”,gn be regular functions. Then gi{X\), ^ 2 (^ 2 ), . . . , gn{Xn) are inde- 
pendent random variables. 

Proof. Assume for simplicity that n = 2. It follows from regularity that for all 
X G M, we can write 



^1 {y ■ 9i{y) < a;} = Ij^i,i(a;) 

i 



and 

A2 ■■= {y ■ g2{y) < x} = |J>l2,i(a:), 

i 

as unions of pairwise disjoint intervals. Therefore, we can write 



P{g\{Xi) < x,g2{X2) < y) 



,i{x),X2 e A2,j{y)) 

i 3 

EE P{Xi e Ai,,{x))P{X2 e A 2 M) 

i 3 



Y.P{X, 6 AUx))Y,P{X2 e A 2 M) 

i 3 

P{gi{Xi) < x)P{g2{X2) < y), 



proving the theorem. □ 

Example 5.6.13. Suppose that X and Y are independent random variables. Then 
also sinX and cosT are independent, since x — > sinx and x ^ cosx are regular 
functions. □ 
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5.7 Sums of Rcmdom Variables 

In Exercise 5.6.8, it was shown already that when X and V have a joint density, 
the product XV is also a continuous random variable with an explicit density. Is 
X -\-Y also a continuous random variable? The answer is yes, and there are various 
ways to see this. 

The first proof uses Theorem 5.6.5. By taking (X, Y) and the function g{x,y) = 
{x + y, y) (which satisfies all requirements) we see that {X + F, Y) is a continuous 
random vector. It then follows from Theorem 5.5.3 that X + F is also a continuous 
random variable. This trick can be used for many other functions of X and F, 
and shows that all these functions lead to continuous random variables. 

In the case of the sum X + F, there is, however, a more direct way to arrive 
at the same conclusion. It is quite simple to compute the density of X + F directly. 
This runs as follows. Writing Z = X + F, and /(x, y) for the joint density of X 
and F, we have 



P{Z<z) ^ [ [ f{x,y)dxdy 

J J {{x,y)\x+y<z} 
roo rz—x 

^ ' ix 



poo pz — x 

= / f{x,y)dydx 

J x= — oo Jy= — oo 
poo pz 

= / f{u,v — u)dvdu^ 

Ju= — oo Jv= — oo 



by the substitution u = x, v = y x. Now interchange the order of the integrals 
and it follows that 



/ oo 

f{x,z-x)dx. (5.9) 

-OO 

When X and F are independent, this formula reduces to 

/ oo 

fx{x)fY{z - x)dx. (5.10) 

-OO 



Example 5.7.1. (The gamma distribution) Suppose that Xi, . . . , X^ are indepen- 
dent and exponentially distributed with parameter A. In this example we show 
that the density of the sum 



5 = Xi + . . . + X, 

is given by 

fs{x) - (5.11) 

(n- 1)! 

A random variable with this distribution is said to have a gamma distribution 
with parameters n and A. 




112 



Chapter 5. Continuous Random Variables and Vectors 



We proceed by induction. For n = 1, the density in (5.11) reduces to the 
exponential density, and hence the claim is true. 

Suppose now that the claim is true for n — 1, that is, the sum 



has a gamma distribution with parameters n and A. Since T and are indepen- 
dent, we can use (5.10) to find 



fs{z) = / 



2 \ n-1 



^n-2g-Ax . 



0 

\n rz 

/ x^-^dx 
Jo 



(n-2)! 



(n-1)! 






proving the induction step. We shall come across this distribution in Chapter 7. 

□ 



5.8 More About the Expectation; Variance 

At this point it is convenient to say a few more things about expectations. Recall 
that we have defined the expectation of a continuous random variable X as 

/ oo 

xf{x)dx, 

-OO 

whenever this integral is well defined. We have also seen already that 

E{aX + b) = aE{X) -h 6, 

which we proved by first computing the density of the random variable aX + b. 

We will now show that in general, E{X -{-¥)= E{X) -f E{Y)^ for continuous 
random variables X and Y. 

Theorem 5.8.1. Let X and Y be continuous random variables with finite expecta- 
tions and joint density /. Then 

E{X + Y)^E{X) + E{Y). 



Proof. There are various ways to prove this. The route that we follow proceeds 
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via the density of the sum X -\-Y, which we obtained in the Section 5.7. We write 



/ oo 

zfx+Y{z)dz 

-oo 

poo poo 

= z /(x, 2 — x)dxdz 

J z= — oo J x= — oo 



{x^-y)f{x,y)dxdy 



/ y= — oo J x= — oo 



/ xf{x,y)dxdy+ y f{x,y)dxdy 

X / f{x,y)dydx+ y f{x,y)dxdy 

[ IR M 

xfx{x)dx+ [ yfY{y)dy = E{X) + E{Y). 

[ JR 



4 Exercise 5.8.2. Extend this result to all appropriate cases where the expectation 
of X and/or Y is infinite. 

The following Theorem 5.8.4 tells us that in order to compute the expectation 
of g{X), for a suitable function ^ : R ^ M, we do not need to compute the density 
(or mass function) of ^(X). In the proof of this theorem, we will need the following 
lemma, the continuous analogue of Exercise 2.3.7. 

Lemma 5.8.3. Suppose that X is a continuous random variable with density f 
satisfying f{x)=0 for all x < 0. Then 



poo 

E{X)= / {1-Fx{x))dx, 

Jo 



where Fx{x) = P{X < x) denotes the distribution function of X, as usual. 

Proof. 

pOO poo poo 

/ {1-Ex{x))dx = f{y)dydx 

Jo Jo Jx 

poo ny 

= / fiy) dxdy 

Jo Jo 

poo 

= / yf{y)dy = E{X). □ 

Jo 

Theorem 5.8.4. Let X be a continuous random variable with density f. Let g be 
such that g{X) is a continuous or discrete random variable. Then 



/ oo 

g{x)f{x)dx, 

-OO 



whenever this integral is well defined. 
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Proof. We will only give the proof in the case where g{X) is a continuous random 
variable. Suppose first that g{x) > 0 for all x. Then the density fg(x){y) is equal 
to 0 for y < 0 and we can apply Lemma 5.8.3 to g{X). This leads to 



E{g{X)) 



poo poo p 

= / P{g{X) > x)dx = / / 

Jo Jo j {y: 

/ oo ng{y) 

/ dxf{y)dy 

-oo J 0 

/ CXD 

9{y)f{y)dy, 

-oo 



{y-9(y)>x} 



f{y)dydx 



which is what we wanted to prove, still under the assumption that ^ > 0 though. 
For general g we use a trick that is very useful in general, when we want to extend 
a result from positive to general functions. We define 



g^{x) = max(^(x),0) and g (x) = — min(^(x), 0). 



We say that g^ is the positive part of g and g the negative part of g. Note that 
both g^ and g~ are positive functions, and that 

g{x) =g^{x)-g-{x). 



Now from Theorem 5.8.1 and the first part of this proof, we have 



E{g{X)) = 



E{g+{X)-g-{X)) 

E{g+{X))-E{g-{X)) 



[ g'''{x)f{x)dx- [ g {x)f{x)da 
Jr Jr 

/ g{x)f{x)dx. 

Jr 



□ 



Example 5.8.5. Suppose that X has an exponential distribution with parameter 
1, and Y = X‘^. We can compute the expectation of F as follows: 

poo 

E{Y) = / x'^e-^dx 

Jo 



4 Exercise 5.8.6. Suppose that (X, F) is a continuous random vector with joint 
density f{x,y), and that ^ ^ R is such that g{X,Y) is a random variable. 

Prove that 



E{g{x,Y)) 




g{x,y)f{x,y)dxdy. 
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4 Exercise 5.8.7. Show, using the previous exercise, that when X and Y are 
independent and have a joint density, then 

E{XY) = E{X)E{Y). 

Definition 5.8.8. The variance var(X) of a random variable X with finite expec- 
tation is defined as 

var(X) = E{X - E{X)f. 

According to Theorem 5.8.4, we can compute the variance of a random vari- 
able X with expectation p and density / via 

/ oo 

(x — pY f{x)dx. 

-oo 

4 Exercise 5.8.9. Show that var(A) = E{X‘^) — {E{X))‘^. This formula is very 
useful for computations. 

4 Exercise 5.8.10. Show that Yar{aX -\-b) = a^var(A), using the previous exercise. 

Example 5.8.11. To compute the variance of a random variable X with a uni- 
form (0, 1) distribution, we proceed as follows. First we compute x‘^f{x)dx = 
Jq x‘^dx = Hence E{X‘^) = and var(X) = 1/3 — (1/2)^ = 1/12. □ 

Example 5.8.12. To compute the variance of a standard normal distributed ran- 
dom variable A, we compute x‘^fx{x)dx = 1. This can be done with partial 
integration. Hence E{X‘^) = 1, and since E{X) = 0, it follows that the variance 
of X is 1. 

If X has a normal distribution with paramaters p and cr^, we use the fact 

that 

Y = ^ ~ ^ 
a 

has a standard normal distribution. Since X = aY + p, this gives, using Exercise 
5.8.10, that var(A) = cr^var(T) = cr^. □ 

4 Exercise 5.8.13. Compute the variance of an exponentially distributed random 
variable. 



4 Exercise 5.8.14. Show that when X and Y are independent and have a joint 
density, then 

var(A + T) = var(A) + var(T). 

Example 5.8.15. (Bivariate normal distribution) Suppose that (A, T) has a bi- 
variate normal distribution with density 



f{x,y) 



1 

27TV'(1 -/)2) 
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where p is a parameter taking values strictly between —1 and 1. At this point we 
can investigate the significance of the parameter p. There is a nice interpretation 
of p, and this can be seen when we compute E{XY). Using the same trick as in 
the proof that / is a density, we find, using Exercise 5.8.6, that 



E{XY) 



n oo 

xyf{x, y)dxdy 

-OO 

f 1 -W f 1 

/ 2 ^^ ; X — , . =( 

Jr Jm v / 27 t (1 - /!) 2 ) 






dxdy. 



The inner integral over x is the expectation of a normal distribution with param- 
eter py and I — p^ and therefore equal to py. Hence we find that 

E(XY)=p [ y^-^e-"^y"dy = p. 

Jr V ^'TT 



We see that the covariance of X and T, defined as 



cov(X,y) = E{XY) - E{X)E{Y), 

is equal to p. In some sense, p quantifies the dependence between X and Y . □ 

4 Exercise 5.8.16. Show that in the above example, X and Y are independent 
if and only if p = 0. This means that X and Y are independent if and only if 
E{XY) = E{X)E{Y). This is quite remarkable: in general independence implies 
E{XY) = E{X)E{Y) but not the other way around. In this case the two notions 
are equivalent. 



5.9 Random Variables Which Eire Neither Discrete 
Nor Continuous 

So far we have distinguished between discrete and continuous random variables. 
The following example shows that there are very natural random variables which 
are neither discrete nor continuous. 

Example 5.9.1. Suppose that we enter a post office, and that we want to describe 
our waiting time until we are served. Call this waiting time X. What would be a 
reasonable distribution for X? There are two possibilities: either there is no one 
in the post office and our waiting time is equal to 0, or there are a certain number 
of customers before us, and our waiting time is positive. If we have to wait, it 
is natural to assume that the waiting time is a continuous random variable, for 
instance with an exponential distribution with parameter A > 0. 

Now observe that the waiting time X is not discrete, since it can take any 
value in M. On the other hand, it is not continuous either since there is a positive 
probability that we do not have to wait at all, that is, P{X = 0) > 0. We can view 
A as a mixture of a discrete and a continuous random variable. □ 
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This example leads to the following formal definition: 

Definition 5.9.2. Let T be a discrete random variable, and Z be a continuous ran- 
dom variable. Furthermore, let U he a random variable with distribution 
P(^U = z=z p and P{U = 0) = 1 - p, for some p G (0,1). Suppose that all 
random variables are independent. Then the random variable X defined by 

X = UY -\-{l-U)Z (5.12) 

is called a mixture of Y and Z. 

We should think of X as being equal to Y with probability p, and equal to 
Z with probability 1 — p. Hence, another way of defining X is as follows: 

^ _ f T with probability p, 

[ Z with probability 1 — p. 

We can define expectations of mixtures in the obvious way as follows: 

Definition 5.9.3. The expectation of the mixture X in (5.12) is defined as 

E{X)=pEiY) + {l-p)E{Z). 

4 Exercise 5.9.4. Define the variance of the mixture X in (5.12). Motivate your 
suggestion. 

Example 5.9.5. Suppose that the waiting time in a post office is 0 with probabil- 
ity and exponentially distributed (with parameter A) with probability |. The 
expected waiting time is now equal = The probability that the 

waiting time is at most 1 is now equal to ^ x 1 + | x P{Z < 1), where Z has an 
exponential distribution with parameter A. □ 

4 Exercise 5.9.6. Finish the computation in this example, and also compute the 
probability that the waiting time is at least 2. 

Mixtures are natural examples of random variables which are neither discrete 
nor continuous. There is, however, also a class of random variables and vectors 
which are neither discrete nor continuous, but which are of a different nature, and 
which cannot be expressed as a combination of discrete and continuous random 
variables. Rather then try to make a formal definition, we give an example, hoping 
that this makes the idea clear. The simplest example is in dimension 2. 

Example 5.9.7. Suppose that X has a uniform distribution on (—1, 1), and consider 
the vector (X, |X|). Clearly, (X, [Xj) is not a discrete random vector, and it is not 
a mixture of a continuous and discrete random vector either. Nevertheless, we 
claim that the vector (X, [X]) has no joint density. To see this, we proceed by 
contradiction. 
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Suppose, therefore, that it has joint density /. Observe that 
P((X,\X\)^{(x,y):y^\x\})=0, 



SO that 



and therefore 



a 

a 



{{x,y):y^\x\} 



{{x,y):y=\x\} 



f{x, y)dxdy = 0, 
f{x,y)dxdy = 1, 



which is clearly impossible. □ 

4 Exercise 5.9.8. Show that for the same X as in Example 5.9.7, the vector 
(X, X^) has no joint density. 



5.10 Conditional Distributions and Expectations 



Suppose that X and Y have a joint density. As in the discrete setting, we would 
like to talk about the conditional distribution of Y given that X takes the value 
X, say. However, since P(X = x) = 0, we cannot define 

P{Y < y\X = x) 



as in the discrete setting. This means that the theory as developed in Chapter 
2, is not useful here. But clearly, we would like to have a notion of conditional 
distributions, even in the continuous case. How can we do that? There are, in fact, 
several possible approaches to this problem. 

One can try to define P{Y < y\X = x) as the limit of P{Y < y\x < X < 
x + A), for A j 0. Under certain weak regularity conditions (which I do not specify 
here), this approach leads to the following computation, assuming that X and Y 
have joint density f{x^y): 



Um P{Y < y\x < X < x + A) 



P{y ^ < X < X -\- X) 

llin Z7 7T ^ 

AjO P(x < X < X + A) 

P f{u, v)dudv 

fx{u)du 

^Il^f{x,v)dv 

hm — — 

A|0 Afx{x) 

J-oo fx{x) 



Hence this approach leads to the conclusion that we should define 



P{Y < y\X = x) 



i: 



f{x,v) 

fx{x) 



dv, 
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and hence that the conditional distribution of Y given X = x {as a function of y) 
has density f{x, ■)/fx{x). 

We now explain a different approach which is slightly more general and which 
is motivated by some aspects of the discrete theory which we discuss first. 

In the discrete setting, we have 

P{Y<y) = <z/|X = x)P(X = x) 

X 

X 

This suggests that in the continuous setting, we could perhaps define the condi- 
tional distribution function of Y given X — x., denoted by TV|x(yk)? implicitly 
via 

/ oo 

FY\x{y\x)fx{x)dx = P{Y <y). (5.13) 

-OO 

This will indeed be our approach. However, relation (5.13) alone is not enough to 
determine the conditional distribution uniquely. We must demand more relations 
to essentially guarantee uniqueness. It is instructive to do this in the discrete 
setting first. 

Let X and Y be discrete random variables. The following result was not 
stated in Chapter 2, but for our current purpose it is very useful. 

Theorem 5.10.1. For discrete random variables X and Y, the conditional distri- 
bution functions Fy\x{'\^) satisfy 

FYix{y\x)P{X = x) = P{Y<y,XG A), (5.14) 

for any y and A. Moreover, if, for fixed y, there is another function t(x) which 
also, for all A, satisfies 

t{x)P{X = x) = P{Y <y,X e A), (5.15) 

xeA 

then PiFYixiy\X)=r{X)) = l. 

Proof. The proof of the first assertion is similar to the proof of Theorem 2.5.7 and 
is not repeated here. 

The second assertion is proved by contradiction. Suppose r{x) exists with 
P{FY\x{y\^) 7^ > 0. Without loss of generality, we may assume that 

P{FY\x{y\^) > ^{^)) > 0 - This means that there is an x with P{X = x) > 0 
and FY\x{y\^) > Now, taking A = {x: FY\x{y\x) > ^(^)}? we find 

Y ^Y\x{y\x)P{X = x)>Y x{x)P{X = x), 

xClA xCA 

which contradicts (5.14) and (5.15). □ 
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This means that in the discrete setting, we could have defined FY\x{y\^) ^ 
any function r(x) which satisfies 

t{x)P{X = x) = P{Y <y,X e A), (5.16) 

xeA 



for all A. 

We now return to the continuous setting. Readers not particularly interested 
in details can skip the following theorem, and read Definition 5.10.3 immediately. 

Let X have a density fx and let Y be any (discrete or continuous) random 
variable. 

Theorem 5.10.2. Let X have density fx- Let r and r' be two regular functions 
which satisfy 

[ r{x)fx{x)dx= [ T'{x)fx{x)dx, (5.17) 

Ja Ja 

in the sense that if one of the integrals exists, so does the other, and in that case 
they have the same value. Then 

P{r{X) = r'(X)) = 1. 



Proof We have 




- r'{x))fx{x)dx = 0 . 



Suppose, again without loss of generality that P{r{X) > r'{X)) > 0. Consider the 
set A = {x : r{x) > r'(x)}. Since r an r' are both regular, the integral fx{x)dx 
exists, and we find 

[ (t{x) - r'{x))fx{x)dx = 0. 

Ja 



Since the integrand is non-negative and piecewise continuous, this means that the 
integrand must be piecewise equal to 0. This implies that fx{x) = 0 for all x e A, 
except for at most a countable number of isolated points. This in turn leads to 




X = 0, 



which contradicts the assumption that P(r{X) > r\X)) > 0. 



□ 



This now enables us to define conditional distributions. 



Definition 5.10.3. Let X have density fx- A conditional distribution of Y given X 

is a family of functions Fy\x{’\x) with the properties that 

(1) for each x, Fy\x{'\x) is the distribution function of a random variable. 
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(2) for each y, FY\x{y\') is regular, as a function of x, 

(3) FY\x{y\^) satisfies, for all a < b, 



f 



FY\x{y\^)fx{x)dx = P{Y <y,a < X <h). 



(5.18) 



Observe that for each fixed y, Fy|x(z/|^) is a random number which is with 
probability 1 the same for any two choices of the conditional distribution, according 
to Theorem 5.10.2. Hence there is a certain uniqueness in this definition. Note that 
this approach is quite general in the sense that we have not required (X, Y) to be 
a continuous random vector. In Example 5.10.11 below we will give an interesting 
application of this. 

Having done a certain amount of work to formally define conditional distri- 
butions, we can now see in a number of examples how this works in practice. The 
general approach in practice is often to guess the form of the conditional distri- 
bution, and after that verify (5.18). Most practical cases will be covered by the 
following special case in which X and Y have a joint density. It will be understood 
from now on that all relevant densities are regular. 

Example 5,10.4. Suppose that (X, Y) have a joint density /, and let x be such that 
fx{F) > 0. In this case, the conditional distribution function of Y given X = x is 



Fy\x{v\x) = f 

J v = - 



fx{x) 



dv. 



To check this, we simply plug this formula into (5.18). This gives 



nb I>y 
J a */ — c 



f{x,v) 

fx{x) 



nb ny 

dvfx{x)dx = f{x,v)dvdx 

J a J — oo 

= P{Y <y,a<X <b), 



as required. Hence the conditional density of Y given X = a; is given by 
and the conditional expection E{Y\X = x) is defined as 

/ oo 

yfY\x{v\x)dy. 

-OO 



These formulas should not come as a surprise, as they are the continuous analogue 
of the discrete formulas. □ 



4 Exercise 5.10.5. Make sure that you understand the last remark by writing 
down the corresponding discrete formulas. 
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Theorem 5.10.6. Let X and Y have joint density /. Then we can compute the 
expectation ofY via 



Proof. 



/ oo 

E{Y\X = x)fx{x)dx. 

-OO 

poo poo 

E{Y) = / yf{x,y)dydx 

J X— — 00 Jy= — oo 
poo poo 

= / yfY\x{y\x)fx{x)dyda 

Jx= — oo Jy= — oo 
poo 

= / E{Y\X = x)fx{x)dx. 

J x=—oo 



□ 



Example 5.10.7. Recall the joint density in Example 5.5.5: 

fx,y{x,y) = 

X 

for 0 < y < X < 1, and ~ ^ otherwise. A simple computation as in 

Example 5.10.4 above now shows that the conditional density frixiul^) is equal 
to 1/x, if 0 < y < X < 1. This is to say that conditional on X, Y is uniformly 
distributed on (0, x). We can now calculate the expectation of T as 



E{Y) = [" E{Y\X = x)fx{x)da 

Jo 



= / 



\xdx — 

2 4 



□ 



Example 5.10.8. Let (X,Y) have the standard bivariate normal distribution of 
Example 5.5.12. A short computation shows that 



fY\x{y\x) = 



fx,Y{x,y) 

fx{x) 

1 

V'27t( 1 - p2) 



(y-px)^ 
2(l-p2) ^ 



which we recognise as the density of a normal distribution with parameters px 
and 1 — p^. It follows that 

E{Y\X = x)= px. □ 

4 Exercise 5.10.9. Let (X, Y") have joint density 

f{x,y) = 



for 0 < X < y, and f{x,y) — 0 elsewhere. Find the conditional density and 
conditional expectation of Y given X. 
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^ Exercise 5.10.10. Suppose that (X, Y) has a joint density and that X and Y 
are independent. Show that fx\Y{^\y) — /x(^)- 

In the next example, there is no joint density, and yet we will be able to 
compute conditional probabilities. 

Example 5.10.11. Consider the random vector (X, |X|) in Example 5.9.7. The 
distribution function Fx{x) of X is given by Fx{x) = ^ + ^x^ for — 1 < x < 1. 
Suppose now that |X| = y. What should be the conditional probability that 
X < X? Let X > 0. Clearly, when 0 < y < x, given that |X| = ?/, the probability 
that X < X must be 1, and when y > x, this probability should be by symmetry. 
We can now verify this by verifying (5.18): 

4|b Exercise 5.10.12. Show first that f\x\{x) 

We now can write, 

[ Fx\\x\{x\y)f\x\{y)dy = 

Jo 

as required. 

4 Exercise 5.10.13. Do a similar computation for the case x < 0. 

□ 

At this point we should halt for a moment and see what we have achieved. We 
have developed a theory that enables us to compute conditional distributions of 
Y given X for many random variables X and Y. There are still situations though 
which cannot be captured satisfactorily with the current machinery. To illustrate 
this, look at the following worrying but important example. 

Example 5.10.14. Consider the exponential random variables Xi and X 2 in Ex- 
ample 5.6.6. Suppose that we ask for the conditional density of Xi -h X 2 given 
that Xi = X 2 . As such, this is a new type of conditioning that does not fit in our 
framework. But we can perhaps make it fit in our framework by looking at 

T 2 - X 1 /X 2 , 

which was already used in Example 5.6.6. Indeed, conditioning on Xi = X 2 should 
be the same as conditioning on I 2 = 1- In Example 5.6.6, we showed that Yi = 
Xi -h X 2 and I 2 are independent, and from this it follows that (using Exercise 
5.10.10) 

fY^\Y2iyi\y2) = /n(2/i) = 

This seems to answer our original question. But we could also have translated the 
original condition Xi = X 2 into the condition that 

Ts Xi - X 2 = 0. 

From this point of view, we are then asked to compute fYiiYsiuilys) • 



1 for 0 < X < 1. 






dy 



11 11 
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4 Exercise 5.10.15. Show that 



for lysl < J/i- 



The required conditional density should therefore be equal to 



/yiiraCyilO) — Ae 



which is different from our first answer! This seems to be a serious problem. Which 
of the two answers is correct and why? 

The answers to these questions are not so easy to understand. The problem 
is that there is no unique correct answer. Both answers are in some sense correct, 
but they express something different. The key point is that in our current theory 
of conditioning, we do not condition on a single event anymore. The conditional 
distribution function Fy|x(y|^) is defined as a family of functions, and the defini- 
tion simply does not make sense for an individual x. Therefore, it is in some sense 
no longer possible to talk about the conditional probability for an individual x, 
but only about a conditional probability for all outcomes of the random variable 
X simultaneously. In our last example, the two random variables ¥2 and T 3 on 
which we conditioned were different, and therefore it should, after all, not come 
as a surprise that the answers turn out to be different. □ 



5.11 The Law of Large Numbers 

In Section 4.1 we proved a number of limit theorems for discrete random variables. 
It will come as no surprise that Theorem 4.1.1 and Theorem 4.1.4 are true also 
for continuous random variables, without any change in the statements. It is not 
so useful to repeat the proofs here completely. In fact, it is a very good exercise 
to do this for yourself. In the proof of Theorem 4.1.1 you will need Chebyshev’s 
inequality. Corollary 2.3.23. 

4 Exercise 5.11.1. Give a proof of Chebyshev’s inequality for continuous random 
variables. 

4 Exercise 5.11.2. Prove Theorem 4.1.1 and Theorem 4.1.4 for continuous random 
variables. 

Example 5.11.3. (Numerical integration) Let ^ : [0, 1] M be a function with 
fo 9 {^)dx < 00 . We can use the law of large numbers to estimate g(x)dx nu- 
merically as follows. 
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Let Xi,X 2 ,... be independent and identically distributed, uniform (0,1) 
random variables. This is to say that their common density fx{x) satisfies fx{^) = 
1 for 0 < X < 1. Let Yi = g{Xi), for i = 1, 2 — We approximate g(x)dx by 



In 



1 

n 






Now observe that E{Yi) = g{x)fx{x)dx = g(x)dx. It then follows from the 
general weak law of large numbers that for all e > 0, 



P 




0 , 



as n ^ oo. Hence, for n large enough. In is with high probability a good approxi- 
mation of Jq g{x)dx. □ 



5.12 Exercises 

Exercise 5.12.1. Suppose we choose a completely random point in the interval 
(—2, 1), and denote the distance to 0 by X. Show that X is a continuous random 
variable, and compute its density and expectation. 

Exercise 5.12.2. Suppose that X is a continuous random variable with density 
/(x) = x^, for 0 < X < t, and /(x) = 0 elsewhere. 

(a) Compute t. 

(b) Compute the expectation of X. 

(c) Compute P(X > 1). 

Exercise 5.12.3. Let X be a continuous random variable with E{X) = 3/5 and 
density /(x) = a + 6x^, for 0 < x < 1, and /(x) = 0 elsewhere. Compute a and b. 

Exercise 5.12.4. Let X be a continuous random variable with density /(x) = cx^, 
X G (0, 1), and /(x) = 0 for other values of x. 

(a) Compute c. 

(b) Compute E{X). 

(c) Compute P(X > ^). 

(d) Compute the variance of X. 

Exercise 5.12.5. Let X be a continuous random variable with density /(x) = 
csinx, for x G (0,7t), and /(x) = 0 elsewhere. 

(a) Compute c. 

(b) Compute E{X). 

(c) Can you think of an example where this distribution arises? 
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Exercise 5.12.6. Suppose that 6 > 0 and a < 0, and let X be a continuous random 
variable with density 

f{x) = ax + 6, 

for X G (0, and /(x) = 0 elsewhere. 

(a) Show that this is a density for all a and b which satisfy ah^ = —2. 

(b) Compute E[X) for these values of a and h. 

(c) Compute P{X > E{X)). 

Exercise 5.12.7. Let X have density fx{x) = cx^(l - x)^, for x G [0,1], and 
fx{x) = 0 elsewhere. 

(a) Compute c. 

(b) Compute P{X > 1/2). 

Exercise 5.12.8. Let Z he a standard normal random variable. Show that for x > 0 

(a) P{Z > x) = P{Z < -x)- 

(b) P{\Z\ > x) = 2P{Z > x). 

Exercise 5.12.9. If X is an exponential random variable with parameter A, and 
c > 0, show that cX is exponential with parameter A/c. 

Exercise 5.12.10. Two people agreed to meet each other on a particular day, be- 
tween 5 and 6 pm. They arrive (independently) at a uniform time between 5 and 
6, and wait for 15 minutes. What is the probability that they meet each other? 

Exercise 5.12.11. Let X and Y have joint density /(x, y) = for x, y > 0. Are 

X and Y independent? Find the marginal distributions of X and Y and compute 
their covariance. 

Exercise 5.12.12. Let X and Y be independent random variables with an ex- 
ponential distribution with parameters y and A. Let U = min{X, T} and V = 
max{X, F}. Finally, \etW = V-U. 

(a) Compute P{X <Y) = P{U = X). 

(b) Show that U and W are independent. 

Exercise 5.12.13. Let X have an exponential distribution with parameter A. Show 
that 

P{X > sPx\X>s) = P{X > x). 

Explain why this is called the lack of memory property of the exponential distri- 
bution. 

Exercise 5.12.14. The random variable X has a double- exponential distribution 
with parameter A > 0 if its density is given by 

for all X. Show that E{X) = 0 and var(X) = 
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Exercise 5.12.15. If X has the double-exponential distribution of the previous 
exercise, show that |X| has an exponential distribution with parameter A. 

Exercise 5.12.16. Let X be a random variable with finite expectation and variance. 
Show that 

var(X) < E{X-af, 



for all a eR. 

Exercise 5.12.17. When we enter a shop, the number of customers ahead of us 
has a Poisson distribution with parameter A. The service time of the customers is 
exponential with parameter A, and all random variables involved are independent. 
Compute the expected waiting time. 

Exercise 5.12.18. Let (X, F) have joint density f{x^y) = e“^, for 0 < x < y, and 
f{x,y) = 0 elsewhere. 

(a) Compute the marginal density of Y . 

(b) Show that fx\Y{^^y) = 1/y, for 0 < x < y. 

(c) First compute E{X\Y = y) and use this to compute E{X). 

Exercise 5.12.19. Let (X, F) have joint density /(x,y) = for x,y > 0 

and f{x,y) — 0 elsewhere. 

(a) Compute the marginal distributions of X and F. 

(b) Show that XF has an exponential distribution with parameter 1. 

(c) Compute E{X\Y — y). 

Exercise 5.12.20. Let (X, F) have joint density f{x,y) = xe“^~^, when x,y > 0, 
and /(x, y) == 0 elsewhere. Are X and F independent? 

Exercise 5.12.21. Let X and F be independent exponentially distributed random 
variables with parameter 1. Let Z = F/X. 

(a) Compute the distribution function of (X, Z). 

(b) Are X and Z independent? 

(c) Compute E{Z\X = x). 

Exercise 5.12.22. Let U and V be independent and uniformly distributed on (0, 1). 
Let X = [7 + F and F = UV . 

(a) Compute the density of (X, F), X and F. 

(b) Are X and F independent? 

Exercise 5.12.23. Let X be a (0, uniform random variable. Find the density of 
F = sinX. 



Exercise 5.12.24. Let X and F have the bivariate normal distribution with density 



f{x,y) = 



1 



2-n^{l - p2) 
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(a) Show that X and Z = {Y - pX)/{l - are independent standard normal 
random variables. 

(b) Deduce from (a) that 

P{X >0,y >0) = 7 + 7 ^ arcsinp. 

4 Z7T 

Exercise 5.12.25. Let X, Y and Z be independent (0, 1) uniform random variables. 

(a) Find the joint density of XY and Z^. 

(b) Show that P{XY < = |. 

Exercise 5.12.26. Show that the density of the sum of two independent standard 
normal random variables is given by 

Exercise 5.12.27. Let X be a continuous random variable with distribution func- 
tion F. Show that 

(a) F{X) is uniformly distributed on (0, 1); 

(b) — logF(X) is exponentially distributed. 

Exercise 5.12.28. Let F be a strictly increasing, continuous distribution function, 
and let U he a uniform (0, 1) distributed random variable. Show that F~^{U) is 
a random variable with distribution function F. 

Exercise 5.12.29. Suppose that X has a Cauchy distribution. Show that 1/X also 
has a Cauchy distribution. 

Exercise 5.12.30. Let X and Y be independent with a standard normal distribu- 
tion. Show that X/Y has a Cauchy distribution. 

Exercise 5.12.31. Let (X, X) have joint density f{x,y) = for 0 < y < x, 

and f{x,y) = 0 elsewhere. Compute cov(X, F). 

Exercise 5.12.32. Compute the conditional distribution of Y given X = x in the 
previous exercise and compute E{Y\X = x). 

Exercise 5.12.33. Suppose X has an exponential distribution with parameter A > 
0. Given X = x, we let F be a uniform point in (0,x). Compute E{Y). 

Exercise 5.12.34. Let X and F be independent and uniformly distributed on (0, 1). 
Compute the density of X -h F. 

Exercise 5.12.35. Show that for X and F independent and standard normal dis- 
tributed, ^(X -h F) has a normal distribution with parameters 0 and 

Exercise 5.12.36. Let X and F be independent standard normal random variables, 
and let Z = ^(X^ + F^). Compute the density of Z. 
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Exercise 5.12.37. Let X and Y be independent random variables with the same 
density f{x) = a{l + Show that arctan(F/X) has a uniform distribution 

on (— 7r/2,7r/2). 

Exercise 5.12.38. Let Xi,X 2 and Xs be independent uniform (0, 1) random vari- 
ables. What is the probability that we can form a triangle with three sticks of 
length Xi, X 2 and X3? 

Exercise 5.12.39. (BuflEbn’s needle) A plane is ruled by the lines y = n {n eZ) and 
a needle of unit length is thrown randomly on to the plane. What is the probability 
that it intersects a line? In order to answer this question, we have to make a few 
assumptions. Let Z be the distance from the centre of the needle to the nearest 
line beneath it, and let © be the angle, modulo tt, made by the needle and the 
x-axis. 

(a) Explain why it is reasonable to assume that Z is uniform on (0, 1), 0 is uniform 
om (0, tt), and that Z and 6 are independent. 

(b) Show that under the assumptions in (a), the vector (^,0) has joint density 

f{z,6) = 

7T 



for all appropriate z and 6. 

(c) Show that the needle intersects a line if and only if 




{z^6) : z < - sin 6 or 1 — z < - sin 6 



(d) Show that the probability that the needle intersects a line is equal to 2/tt. 

(e) Explain how you can use (d) to approximate tt numerically. (Think of the law 
of large numbers.) 

Exercise 5.12.40. Let X be uniform on (0, 1). Compute P{X < x|X^ = y). 
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InjSnitely Many Repetitions 



In the Intermezzo, we found that an infinitely fine operation like choosing a point 
on a line segment can not be captured with countable sample spaces, and hence 
we extended our notion of a sample space to be able to deal with that kind of 
operation. 

However, there is yet another problem, namely how to deal with infinitely 
many repetitions of an experiment. For instance, we might want to talk about an 
infinite sequence of independent coin flips. In this chapter we look at the problem of 
how to model and construct an infinite sequence of independent random variables. 

Maybe you have the impression that infinitely many coin flips should not be 
that important. Granted, we will never be able to perform infinitely many such 
flips, and so far we have not really needed to do this, not even in the laws of 
large numbers in Section 1.6 and 4.1. In these laws of large numbers, we first took 
n random variables, then computed a probability, and finally took the limit for 
n ^ 00 . There was no need for infinitely many independent random variables. 

Sometimes, however, we want to take the limit first and after that compute 
a probability. So, for instance, we would like to talk about the probability that 
the average outcomes converge to an expectation. And sometimes it is not clear 
in advance how many independent random variables with a certain distribution 
we need. For situations like this, we need to define infinitely many independent 
random variables on one and the same probability space. In the first two sections 
of this chapter, we shed some light on this. If you are not particularly interested 
in these details, it is safe to continue with Section 6.3 with the knowledge that 
infinitely many independent random variables Xi,X 2 ,... with any distribution 
can indeed be defined on some probability space. In the subsequent sections, we 
discuss some applications, notably strong laws of large numbers, random walk and 
branching processes. 
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6.1 Infinitely Mciny Coin Flips and 
Random Points in (0, 1] 



In this section we will show that it is, to some extent, possible to define infinitely 
many coin flips on a very familiar sample space, namely on = (0, 1]. We start 
by recalling from Chapter 5 one aspect of the selection of a completely random 
point from f2, because we need this for the coin flips: the probability that such a 
random point is chosen in 

oo 

i=l 

where the intervals {ai^bi] are pairwise disjoint, was defined as 

oo 

P{A) = - a^). (6.1) 

2=1 

How does this relate to coin flips? Well, as before, we associate with each 
a; G its binary expansion: 

^ oi ^ .di{u)d2{uj) • • • 

n=l 

If Lu has more than one such expansion, we take the one with infinitely many Is. 
So for instance 1/2 can be written as .1000000 • • • but we make the convention to 
use 1/2 = .01111111---. Now consider the sequence 

(di(w),d2(w),---)- 



We claim that this sequence behaves, to some extent, as if it resulted from an 
infinite sequence of flips of a fair coin. To see what we mean by this, note that 



{lo : di{uj) = Ui,i = 1,. . . ,n} 



2i ’ 2-^ 



1 

22 2^ 



( 6 . 2 ) 



where the i/^’s take values in 0 and 1. To understand (6.2), one can perhaps first 
look at those lus which satisfy lji = 0. The set of us that satisfy o;i = 0 form the 
set (0, ^]. Among these cjs, only those in (|, ^] have U 2 = etcetera. 

But now we see that we can assign a probability to the set {u : di{u) = ui 4 = 
1, . . . ,n}, using (6.1) above. Indeed, the length of the interval at the right-hand 
side of (6.2) is 2“^^, and we conclude that 

P{u : di{u) =Ui,i = l,...,n) = 

Interpreting di{u) as the outcome of the ith flip of a coin, we see, indeed, that the 
probability of a given sequence of outcomes of length n is equal to 2“^^, which is 
what our probabilistic intuition requires. 
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This means that we have defined, with a very simple sample space, a mathe- 
matical model that represents infinitely many coin flips. The models for coin flips 
in earlier chapters were only able to model finitely many coin flips. 

Infinitely many events are said to be independent if each finite subcollection 
is independent. In the present context this corresponds to independence of any 
collection i G /}, for any finite index set I. The next exercise shows that 

indeed, different coin flips are independent. 

4 Exercise 6.1.1. Show that 



P{di — Ui,i e I) ^ = Ui), 

iei 

for any finite index set I and numbers Ui G {0, 1}. 

We have performed a most remarkable construction. We simultaneously con- 
structed a model for picking a point randomly from (0, 1], and a model for infinitely 
many coin flips. Many interesting events do have probabilities in this simultaneous 
model. 



Example 6.1.2. Consider the event that we see k heads among the first n flips. We 
know from Example 1.2.8 that the probability of this event should be In 

the present context we can rederive this result as follows. The event 

n 

LU : ^^di{uj) = k 
i=l 

of seeing k heads among the first n flips is the union of those intervals in (6.2) that 
have k of the ufs equal to 1 and n — k equal to 0. There are (^) of such intervals, 
and each of them has length 2“^, which leads to the required probability. □ 

Can we assign a probability to all reasonable subsets of Q now? The answer 
is no. There are very interesting and important subsets which can not be written 
as such a countable union. As an example, consider the set 



L=\uj ■. lim 1 ^ di{w) = ^ i • (6.3) 

I 2j 

This set cannot be expressed as a union of countably many intervals. 

4 Exercise 6.1.3. Can you prove this? 

In an attempt to assign probabilities to a larger collection of sets, hopefully 
including L, we make the following definition. 
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Definition 6.1.4. A subset A of is called small if for each e > 0, there exists a 
finite or countable collection /i, / 2 , . . . of intervals (possibly overlapping) satisfying 



Ac\Jh 

k 


(6.4) 


Y^P{Ik)<e. 


(6.5) 



k 

It is reasonable to assign probability 0 to small sets, and this is precisely 
what we do now. 

Definition 6.1.5. For any small set A, we define P{A) = 0, and P{A^) = 1. 

In Section 6.3 below we shall prove that is small, and hence, P{L) = 1. 



6.2 A More General Approach to Infinitely Many 
Repetitions 

In the last section, we formulated a construction of infinitely many independent 
coin flips. Beautiful as the construction was, it was also rather limited in its scope. 
In this section we sketch a very simple construction which allows for infinitely 
many independent random variables with arbitrary distributions. For this, we 
need a new sample space. 

Consider Q = R®®, the set whose elements can be written as 

w = (a;i,a;2,a;3,...), 

with cui G R, for all i. Also let, for all i = 1, 2, . . ., fi be a density on R. 

Given this sequence of densities, how can we assign probabilities to suitable 
subsets of The idea is that we first only look at so-called finite- dimensional 
subsets A. A subset A C is called finite-dimensional if we can decide whether 
or not u; G A by only looking at a (nonrandom) finite set of indices. Here is an 
example. 

Example 6.2.1. The set {a; G 0 : < 5} is a finite-dimensional set, since 

we need only look at the second and third coordinates to decide whether or not 
Lj E A. The set {lu E ^ : Wi = 0 for all even indices} is not finite-dimensional. □ 

Since any finite-dimensional event depends on only finitely many indices, 
there is, for any such event, an index n such that A depends only on the indices 
1, . . . ,n. Hence, any finite-dimensional event A can, for some n, be written as 

oo 

A = A„x K, 
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where A^ For this finite-dimensional event, we define the probability P{A) 

of A as 

P(^) = /•••/ fl{xi)---fn{Xn)dXi---dXn- 

J JAn 

In this way, we assign a probability to any finite-dimensional subset A. 

Example 6.2.2. Let, for all z, fi be the uniform density on (0, 1), that is, fi{x) = 1 
for 0 < X < 1, and fi{x) = 0 elsewhere. Consider the set A C ft defined by 

Then A is finite-dimensional, and P{A) = 1 x | = |. □ 

Example 6.2.3. In the same situation as the previous example, let B be the set 

B = \lo G I u)\ “h (j02 <! 1} 5 

then B is finite-dimensional, and P{B) = □ 

We can extend the collection of subsets that receive a probability in two 
ways, as follows. 

1. If A^,A^,. . . are disjoint finite-dimensional sets with union A, we define 

P{A) = Y,P{A% 

i 

2. We define a set to be small precisely as in Definition 6.1.4, with intervals 
replaced by finite-dimensional sets. As before, we assign probability 0 to all 
small sets, and probability 1 to the complement of a small set. 

All subsets that received a probability by this procedure are called events^ 
just as before. In the forthcoming sections, all sets under consideration will be 
events, but we will not mention this explicitly each time. 

A random variable on is now simply defined as a mapping from to R 
such that for all — (X) < a < 5 < oc, the set 

{lo G Q : a < X{lo) < b} 

is an event. With this construction, we can now define infinitely many independent 
random variables Ai, A 2 , . . . as follows: 

Xi{uj) = uji, 

for i = 1,2, It is clear that Xi is a random variable with density /i, for all i. 

It is also clear that 

k 

P{Xi, <ai,...,Xi^< ak) = JJ P{Xi. < aj), 
and hence the Xfs are independent. 
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4 Exercise 6.2.4. Show that we can also define an infinite sequence of independent 
random variables with an arbitrary discrete distribution. 



6.3 The Strong Law of Large Numbers 



In this section we discuss three approaches to what is called the strong law of 
large numbers. The phrase strong refers to the fact that in these laws, the limit 
is inside the probability, as opposed to the laws of large numbers that we have 
seen so far. The first strong law goes back to Theorem 1.6.1, and deals with the 
situation in which the random variables can only take the values 0 and 1 with 
equal probability. 

Consider independent random variables Xi, X 2 , . . ., all with the same distri- 
bution given by P{Xi = 1) = P{Xi = 0) = for all i. In (1.5) it was shown (in 
the current notation) that 






> e < 2e“ 



( 6 . 6 ) 



where the factor 2 comes from the fact that we need to consider deviations above 
and below ^ . We use this formula with a surprising trick: we take e depending on 
n, namely e = The reason for this choice becomes clear in a moment. We 

obtain 

/\. n ,1 \ 

< 2e”^ 



Define 



F = ( lim - ^ ^ 1 • 

1 n-^oo n 2 J 



(6.7) 



We want to show that is small, in the sense of Section 6.2. Denote the event 
in (6.7) by ^n- A little thought reveals that 



CX) 

[j An 

n=m 



( 6 . 8 ) 



for all m. Since P{An) < 2e“^, the sum J2n P{^n) is finite (this is the reason for 
our choice of e above), and hence -^(An) can be made arbitrarily small by 

taking m large. This implies by (6.8) that is small, and hence that P{F) = 1. 
We have proved the following result. 

Theorem 6.3.1. (First strong law of large numbers) Let , X 2 , . . ., be independent 
random variables, all with the same distribution given by P(Xj = 1) = P(X- = 
0) = 1, for alii. Then 
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With some extra work we can generalise Theorem 6.3.1 considerably, as fol- 
lows: 

Theorem 6.3.2. (Second strong law of large numbers) Let Xi,X 2 , . . . be inde- 
pendent and identically distributed random variables, such that E{Xf) < oo. We 
write E{Xi) = pi. Then 



P\ lim 

\ n— >cxD n 






(6.9) 



i=l 



Proof. We first assume that pt = 0, and we denote the event in (6.9) by M. With 
Markov’s inequality, we can show, writing Sn = Xi ‘ X^, that 



>e < 



E(S^) 



We now write 

E{St) = 



n n n n 



= EEEE E{XiX,XkXi) 

2=1 j=l fe=l i = l 

= j2E{Xf)+ E{Xf)E{X]) 

i=l = 

= nE{Xf) + 3n{n — l)E{Xf)E{X 2 ) (see explanation below) 
< cn^. 



for some positive constant c. To understand the middle two lines, note that if a 
certain Xi appears in isolation in the four- fold product, then the corresponding 
expectation is 0. There are n ways to get a term of the form Xf and ( 2 ) ^ — 

3n(n — 1) ways to obtain a term of the form XfXj for i ^ j. All other terms are 
zero since they contain at least one Xi in isolation. 

4 Exercise 6.3.3. Check these last combinatorial statements. 



It now follows that 



>6 < 



(6.10) 



So far, e was independent of n. But now we choose The reason for this 

somewhat strange choice is the fact that the series converges. Hence, 

writing An for the event in (6.10) we now have that 



'^P{An) < OO. 

n=l 
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If u; is not in An for all n larger than some m, then |n for all n > 

and since ^ 0 as n ^ oo, this implies that ou e M. We conclude that 

CX) 

f]^n CM, 



for all m, which is the same as 



oo 

C [j A„. 

n=m 



The AnS form a collection of disjoint intervals, and since P{An) < oo, we can, 
for any e > 0, choose m so large that P{^n) < This means that is 

small. 

Note that we have assumed p = 0 so far. To prove the general statement, 
assume that E{Xi) = //, and define Yi = Xi~ p, for i = 1, 2, — Then E{Yi) = 0, 
and according to what we have already proved, it follows that 



1 

n 



1=1 



with probability 1, as n — > oo. This means that 



1 "" 

-Tx^-p^o 

n 



and we have proved the general result. 



□ 



We can also approach the strong law via the setup of Section 6.1. Recall the 
set L which was defined as 



Theorem 6.3.4. (Third strong law of large numbers) For L as in (6.11), we have 
that is a small set, and hence 



P{L) = 1. 

In fact, the proof is very similar to the proof of Theorem 6.3.2. It is instruc- 
tive, however, to see how a proof in the setup of Section 6.1 proceeds, and we 
therefore give some details. We first need a preliminary lemma. 
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Lemma 6.3.5. Let 0 = xq < < • • • < Xk-i < Xk = I, and let / : (0, 1] ^ M be 

such that f is constant between consecutive Xi ’s: f{uj) = Cj, for all uj G 
Then, for a > 0; the set {uj : f{uj) > a] is a finite union of intervals and 

P{lo : f{uj) > a) < 1 [ f{u))(ko. 

a Jo 

Proof. The set in question is simply the union of those intervals {xj-i,Xj] for 
which Cj > a. Hence 

aP{w : f{w) >a) = a (xj - x,^i) 

j‘.Cj>OL 

< Cj{Xi-Xj-i) 

j:cj>a 

3 

= ^ ^ 

4 Exercise 6.3.6. Do you see the similarity between this lemma and Markov’s 
inequality? 

Proof of Theorem 6.3.4. It is convenient to change the digits from Os and Is to 
+ls and —Is. Hence we define rn{oj) — 2dn{oj) — 1, that is, if dn{uo) = 1, then 
rn{uj) = 1, and if dn{u3) = 0, then rn{uj) = -1. Now consider the partial sums 

n 

i=l 

Clearly, L and 




define the same set, and it suffices to show that is small. Applying Lemma 
6.3.5 to f{u) = and a = n^e^, we obtain 

P{UJ : |s„(o;)| > ne) < f (6.12) 

'a e Jq 

Clearly, we have 

where all four indices range from 1 to n. The terms inside the sum can be of 
various forms. 
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If the four indices are all the same, the outcome is clearly 1, and the same is 
true if they are pairwise the same, that is, of the form {uj)r‘j {u) for i ^ j. There 
are n occurrences of the first type, and 3n{n — 1) of the second: indeed, there are 
n choices for the first index i, three ways to match it with j, k or I, and then n — 1 
choices for the value common to the remaining two indices. 

If the product is of the form rl{uj)rj{uj)ri{uj) for different i, j and fc, then this 
reduces to rj{u)ri{uj). Assume without loss of generality that i < j. Now observe 
that on a dyadic interval of size ri is constant and Vj has value —1 on the 

left, and +1 on the right. The product riVj therefore integrates to 0 over each of 
the dyadic intervals of size and we find that 

1 

r\{u)rj{u)ri{uo)duj = 0 , 

if k,i and j are all different. 

Since rf{ou)rj{uj) = ri{uj)rj{uj)^ also 




= 0 , 



if i 7^ j. 

Finally, a similar argument shows that 

/ ri{uj)rj{uj)rk{uj)ri{uj)duj = 0 . 

4 Exercise 6.3.7. Prove this last statement. 

Putting everything together now gives 

[ sl(uj)duj = [ r'^{uj)(Lj+ ^ f rf{u)r^{u;)duj 

Jo ^ 1 Jo ...... Jo 

= n + 3n(n — 1) < 3n? . 

It now follows from (6.12) that 



P[u 



^n(^) 



n 



>e\< 



- I - n2e4- 



The proof is now finished exactly as the proof of Theorem 6.3.2 
4 Exercise 6.3.8. Make sure that you agree with this last statement. 



□ 
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6.4 Random Walk Revisited 

In Chapter 3 we discussed random walk , defined as 

n 

i=l 

where Xi,. . . ,Xn are independent random variables, all with the same distribution 
given by P{Xi = 1) = P{Xi — —1) = ^. In the current context, we can investigate 
the event that the random walk returns to its starting point 0, that is, we can study 
the event 

A = {Sn 7^ 0 for all n > 0} . 

The following theorem tells us that A has probability 0, in other words, the prob- 
ability that the random walk returns to 0 is equal to 1. This is called recurrence 
of the random walk. 

Theorem 6.4.1. (Recurrence of the random walk) We have P{A) = 0. 

Proof. The proof is very easy, given what we have already done in Chapter 3. 
Indeed, in the proof of Lemma 3.1.14 we showed that 

P{SlS2 ■■■S2m^0)= (0 , (6.13) 

and in the proof of Theorem 3.2.3 we showed that the right-hand side of (6.13) 
tends to 0 as m ^ 00 . Since 

AC{5i52---52^7^0} 

for all m, this means that A is small, and therefore has probability 0. □ 

6.5 Branching Processes 

In this section we discuss a few aspects of branching processes. A branching process 
is a simple model for reproduction and can be described as follows. 

Suppose that a population evolves in generations, and denote by Zn the 
number of members of the nth generation. Each member of the nth generation 
gives birth to a random number of children, which will be members of the (n+ l)th 
generation. We make two basic assumptions about the number of children: 

1. The number of children of different members of the population are indepen- 
dent of each other. 

2. The number of children of different members of the population all have the 
same distribution. 
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Assuming that Zq = 1 , we are interested in the random variables Zi, Z 2 , Z 3 , — 
The description of the process can be formulated as follows: 

^n+l = + X 2 + • • • + Xz^ , (^-14) 

where the A^’s are independent random variables representing the number of chil- 
dren of a particular member of the population. Indeed, each member of the nth 
generation begets a random number of children, and therefore the number of mem- 
bers of the (n-h l)th generation is given as in (6.14). Note that the number of A^’s 
in (6.14) is random. We call the common distribution of the X^’s the offspring 
distribution. 

It turns out that a branching process can be well studied with the help of 
generating functions, which were defined in Section 2.6. In the current situation 
we have to deal with a sum of independent random variables of random length. 
To this end, we state the following lemma. 

Lemma 6.5.1. Let Xi,X 2 , . . . be a sequence of independent identically distributed 
random variables taking values in N and with common generating function Gx- 
Let N be a random variable, independent of the Xi ’s, also taking values in N, with 
generating function Gn- Then the sum 

5 - Xi + X 2 + • • • + Xiv 

has generating function given by 



Gs{s) = Gn{Gx{s)). 

Proof We write 

Gs(5) = E{s^) 

00 

= ^£;(s®|iV = n)P(Ar = n) 

n=0 

CX) 

= ^ = n) 

n=0 

00 

= YjE{s^^)---E{s^^)P{N = n) 

n=0 

00 

= J2iGx{s)rP{N = n) = GM{Gx{s)). 

n=0 CH 



Writing Gn for the generating function of Z^, Lemma 6.5.1 and (6.14) to- 
gether imply that 



G„+i(s) = G„(Gi(s)) 
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and iteration of this formula implies that 



G„(s) = Gi(Gi(---(Gi(s)) ■••)), (6.15) 

the n-fold iteration of G\. Note that G\ is just the generating function of an 
individual Xi, and we write G = Gi from now on. In principle, (6.15) tells us 
everything about For instance, we can now prove the following: 

Theorem 6.5.2. If E{Xi) = and var(X^) = then 

E{Zn) = /i", 

and 

var(Z„j - I _ 1)-1 

Proof. Differentiate Gn{s) = G{Gn-i{s)) at s = 1, and use Theorem 2.6.7(a) to 
find 

E{Zji) = fiE{Zn-l)- 

Now iterate this formula to obtain the first result in the current theorem. 
Differentiate the same formula twice to obtain 



G"(l) = G"(1 )(G;_i( 1))2 + G'(l)G^i(l), 



and use Theorem 2.6.7(b) to obtain the second result. 



□ 



4 Exercise 6.5.3. Provide all details of this last proof. 

Hence, the expected number of members in a branching process grows or 
decays exponentially fast. If the expected number of children is larger than 1, the 
expectation grows to infinity, if it is smaller, it decays to zero. This fact should 
make us curious as to whether it is possible that the branching process survives 
forever. To this end we first prove: 

Theorem 6.5.4. The probability 77 that Zn = 0 for some n is equal to the smallest 
non-negative root of the equation G{s) = s. 

Example 6.5.5. Here is an example of Theorem 6.5.4 in action. Consider a branch- 
ing process with offspring distribution given by P{X = 0) = P{X = 1) = \ 
and P{X = 2) = |. The generating function G is now given by 

G(*) = +-^s+-. 

Solving G{s) = s gives 5 = | and s = 1. The smallest non- negative solution is 
s = ^ , and therefore this process survives forever with probability | . □ 
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Proof of Theorem 6.5.4. The probability rj of ultimate extinction can be approxi- 
mated by r]n = P{Zn = 0). Indeed, it is not hard to see that rjn ^ rj SiS n ^ oo. 
We now write 



rjn = P{Zn = 0) = G„(0) = G(G„_i(0)) = G(r/„_i). 

Now let n ^ oo and use the fact that G is continuous to obtain 

V = G{t]). 

This tells us that rj is indeed a root of G{s) = s, but the claim is that it is the 
smallest non- negative root. To verify this, suppose that e is any non- negative root 
of the equation G{s) = s. Since G is non-decreasing on [0, 1] we have 

rji = G{0) < G{e) = e. 



and 

rj 2 = G{t]i) < G{e) = e, 

and so on, giving that rjn < e for all n and hence rj < e. 



□ 



The next result tells us that survival of a branching process is only possible 
when /i > 1, ignoring the trivial case where P{X = 1) = 1, in which case the 
process trivially survives with probability 1. 



Theorem 6.5.6. When p < 1, the branching process does not survive with proba- 
bility 1; except for the trivial case where P{X = 1) = 1. When p > I, the process 
survives forever with positive probability. 



Proof According to Theorem 6.5.4, we need to look at the smallest non- negative 
root of the equation G{s) = s. 

Suppose first that p > 1. Since 



G'(l) = M, 



we have that G'{1) > 1. Since G{1) = 1, this means that there is some s' < 1 
for which G{s') < s'. Since G(0) > 0 and since G is continuous, there must be 
some point s" between 0 and s' with G(s") = s", which implies that the smallest 
non-negative solution of G(s) = s is strictly smaller than 1. Hence the process 
survives forever with positive probability. 

Next, consider the case in which p < 1. Note that 



G'{s) = Y, ns’^^'^P{X = n)>0, 

n=l 



G"{s) = Yn{n- l)s”-2p(X = n) > 0, 

n=2 



and 
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where the strict inequalities come from the fact that we have excluded the case 
P{X = 1) = 1. This implies that G is strictly increasing and strictly convex. 
Hence if G'(l) < 1, then G'{s) < 1 for all s G [0,1] and then it is easy to see 
that G{s) > s for all s < 1, and therefore the smallest non- negative solution of 
G{s) = s is s = 1, proving the result. □ 



6.6 Exercises 

Exercise 6.6.1. From [0, 1] remove the middle third (|, §)• From the remainder, a 
union of two intervals, remove the two open middle thirds (|, |) and (|, §)• Con- 
tinue in the obvious way; what remains when this process is repeated infinitely 
often is called the Cantor set. 

(a) Show that the Cantor set consists of exactly those points whose trinary ex- 
pansion (that is, to the base 3) contains no Is. 

(b) Show that the Cantor set is small in the sense of Definition 6.1.4. 

Exercise 6.6.2. Show that Sn(co)dcu = 0. 

Exercise 6.6.3. Show that s^(cj)dcj — n. 

Exercise 6.6.4. Show that any countable subset of [0, 1] is small in the sense of 
Definition 6.1.4. 

Exercise 6.6.5. Consider a branching process with a geometric offspring distribu- 
tion P(X = k) = (1 — p)p^, for k = 0, 1,2, Show that ultimate extinction is 

certain if p < ^ and that the probability of extinction is (1 — p)/p if p > 

Exercise 6.6.6. Show that for a branching process (Z^) with expected offspring fi, 
we have 

E{ZmZn) = 



for m < n. 




Chapter 7 

The Poisson Process 



In this chapter we discuss a probabilistic model which can be used to describe the 
occurrences of unpredictable events, which do exhibit a certain amount of statisti- 
cal regularity. Examples to keep in mind are the moments at which telephone calls 
are received in a call centre, the moments at which customers enter a particular 
shop, or the moments at which California is hit by an earthquake. We refer to an 
occurrence of such an unpredictable event simply as an occurrence. 

7.1 Building a Model 

When we try to model the occurrences of the various processes described above, 
then there are a number of characteristics that we may want to build in. To name 
a few: 

1. There is a certain amount of regularity in the processes described above. 
Although individual earthquakes are impossible to predict and certainly do 
not occur in a strictly regular pattern, there is perhaps some statistical 
regularity in the sense that when we observe earthquakes during 10 years, 
say, without knowing the absolute time frame, then we have no way to decide 
whether we observe the time period 1910-1920 or 1990-2000. In probabilistic 
terms, the process is stationary in time. In other words, the course of time 
should not change the probabilistic properties of the process. (Of course, in 
the case of a shop this can only be realistic as long as the shop is open, 
and even then one can ask whether there will typically be more customers 
around closing time than around 3 p.m., say. More about this in Exercise 
7.5.8) 

2. The fact that there is an occurrence at a particular time, says nothing about 
the probability of an occurrence at, or around, a later or earlier time. In 
other words, there seems to be some kind of independence with respect to 
various occurrences. 
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3. The next occurrence can not be predicted from current and past information. 
In other words, the process of occurrences seems to have no memory. The 
fact that something happened in the past has no effect on the probabilities 
for future occurrences. 

4. There is no accumulation of occurrences at any time. In other words, in each 
finite time interval, there are only finitely many occurrences. 

When we try to build a mathematical model for this kind of process, we should 
keep these characteristics in mind. In fact, the characteristics pave the way to the 
appropriate model, as we will demonstrate now. It is quite important that you see 
how this works. Doing computations in a given model is one thing, but to make 
an appropriate model is obviously of the highest importance. If the model is not 
appropriate, then any computation in it has very little, if any, value. 

So how can we build a model in the context of the characteristics described 
above? There are, in fact, a number of ways and we think that it is very instructive 
to travel them all. 

Approach 1. (via waiting times) Perhaps the most natural thing to do is to con- 
centrate on the waiting times. Point (1) above suggests that the waiting time 
distribution should be the same at all times: the waiting time between the 6th and 
7th event should have the same distribution as the waiting time between the 12th 
and 13th. 

Point (3) above suggests that the process should have no memory. We have 
come across a continuous distribution with a certain lack of memory property. 
Indeed, in Exercise 5.12.13 we showed that the exponential distribution has no 
memory in the sense that when X has such an exponential distribution, then 

P{X> s + t\X>t) = P{X> s). 

If we think of X as the waiting time between successive occurrences, then this 
formula expresses the idea that the fact that we have waited already t time units 
does not change the probability that we have to wait another s time units. This 
property makes the exponential distribution a serious candidate for the waiting 
times between successive occurrences. 

The candidacy of the exponential distribution becomes even better motivated 
when we look back at Example 5.1.6. In this example we showed that the expo- 
nential distribution is a very natural candidate to model the waiting time for the 
next occurrence. 

Hence we might be inclined to define the following model. We consider in- 
dependent random variables Xi, X 2 , . . ., which are exponentially distributed with 
parameter A, that is, they have density / given by 

fix) = Ae-^", 



for X > 0, and f{x) = 0 for x < 0. The first occurrence is at time Xi, the second 
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at time Xi + X 2 , etcetera. In general, we can define 

n 

Sn = '^Xi, 

i=l 

and then Sn should be thought of as the moment of the nth occurrence. 

In principle, this defines the model completely. It is possible to do compu- 
tations now, like computing the probability that there is no occurrence between 
time 5 and time 10; we come back to this. For now, it is perhaps good to look 
at the parameter A. Since E(Xi) = A“^, we see that a high A corresponds to a 
small average waiting time. Therefore, it makes sense to call A the intensity of the 
process. 

Approach 2. (via the number of occurrences) Another approach becomes appar- 
ent when we look back at Example 1.5.11. In that example, we undertook the 
enterprise to model the arrival of customers in a shop between time t = 0 and 
t = 1, not by focussing on the waiting times, but by concentrating on the number 
of customers. The waiting time should be a continuous random variable, but the 
number of customers is obviously discrete. Using a discrete approximation with 
the binomial distribution, we showed that it is reasonable to assume that the 
probability of having k customers in the shop between time 0 and time 1 is equal 
to 

P{N = k)=e-X, 

for A: == 0, 1, . . ., which we recognize as the probability mass function of a Poisson 
distribution with parameter A. 

How can we extend this to longer time intervals? Well, using the idea of 
independence, mentioned in point (2) above, we might say that the numbers of 
customers in two disjoint time intervals should be independent. So the number 
of customers between t = 1 and t — 2 should be independent of the number 
of customers between t = 0 and t = 1, and have the same distribution. As a 
result, the total number of customers in the time interval between t = 0 and 
t = 2 can be seen as the sum of two independent Poisson distributed random 
variables with parameter A, and we know from Example 2.4.13 that this yields a 
Poisson distribution with parameter 2 A. This leads to the idea that the number of 
customers in a time interval of length L should have a Poisson distribution with 
parameter XL. 

This approach does not tell us immediately when the customers arrive, it 
only tells us that in a given time interval the number of customers should have a 
Poisson distribution, with a parameter which is proportional to the length of the 
interval. Since we want our process to be stationary, it is reasonable to distribute 
all customers over the interval in a completely arbitrary way. The model then 
amounts to the following two-step procedure for defining a process on any time 
interval of length L: 
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1. Draw a random number N from a Poisson distribution with parameter XL. 
This number represents the total number of occurrences in the time interval 
of length L under consideration. 

2. Each of the N customers is given a position, which we choose independently 
and uniformly distributed over the time interval. 

It is not a priori clear that this approach leads to the same model as in 
Approach 1. We shall see shortly that they are, in fact, equivalent. At this point, 
we only note the following connection: the probability that there is no occurrence 
before time t is, according to the current approach, equal to e~^^. Now note that 
this is equivalent to saying that the waiting time for the first occurrence has an 
exponential distribution with parameter A, in full agreement with Approach 1. 

Approach 3. (via differential equations) Still, there are other ways to approach the 
problem, without reference to earlier exercises or examples. 

We can ask about the probability that at time h (think of h as very small) 
exactly one occurrence has taken place, or about the probability that by time /i, 
no occurrence, or more than one occurence has taken place. A natural way to do 
this proceeds as follows. 

Let Pi{h) be the probability that at time h, exactly i occurrences have taken 
place. Let us assume that pi{h) is differentiable in h = 0, with derivative A > 0, 
that is, 

p'i(0) = A. (7.1) 

The requirement that we do not want accumulation of occurrences can be formu- 
lated in terms of p> 2 {h) = requirement that 

p'>2(0) = 0. (7.2) 

Since 

Po{h) +Pi{h) -\-p>2{h) = 1, 

we see, using (7.1) and (7.2), that po{h) is differentiable in h 
-A, that is 

Po(0) = -A- 

Assuming (7.1), (7.2) and (7.3) we now argue as follows. 

Po{t + h) = P(no occurrence before t + /i) 

= P(no occurrence before t)P{no occurrence between t and t + h), 

since we want the number of occurrences in disjoint time intervals to be indepen- 
dent. By stationarity, the probability of having no occurrences between times t and 
t h should be the same as po{h), and the probability of having no occurrences 
before time t is by definition equal to po {t) . Hence we find 



= 0 with derivative 
(7.3) 



Po{t + h) =po{t)po{h), 
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and a little algebra shows that this leads to 



Po(t + h) - po(t) po(h)-l 

— = Po{t) : 



h 



Po{t) 



Po{h)-Po{0) 

h 



Taking the limit for h — > 0 and using (7.3) now leads to 

Po(t) = ->^Po{t), 



and this differential equation is easily solved (using the boundary condition po(0) = 
1), giving 

Po{t) ^ 

Now this is quite interesting, since this expression tells us that the probability 
that there is no occurrence before time t is equal to e~^^. But this means that the 
probability that the waiting time for the first occurrence has an exponential distri- 
bution with parameter A, again in full agreement with Approach 1 and Approach 
2. 



We see that so far, the various approaches seem to be compatible. They 
seem to direct us towards a model in which waiting times are continuous with 
an exponential distribution, and where the total number of occurrences in a given 
time interval can be described with a Poisson distribution. Is there a first choice? Is 
one approach better than the other? Hardly, at least at this point (but see Exercise 
7.5.9). When we choose a particular approach (for instance Approach 1) then the 
fact that waiting times are exponentially distributed is true by definition, whereas 
the fact that the total number of occurrences by time t has a Poisson distribution, 
has to be proved. If we choose Approach 2, then this last fact becomes true by 
definition, and we have to prove the statements about the waiting times. Hence 
the choice of the approach is to a large extent arbitrary. 

In this chapter, we choose the first approach, perhaps because this is the 
easiest conceptually. Hence, we make the following definition. 



Definition 7.1.1. Let Xi,X 2 ,... be independent and identically distributed ran- 
dom variables with an exponential distribution with parameter A > 0. Let = 0, 
and for n = 1,2,..., let 

n 



Sn = Y^Xi^ 



i=l 



We call Xi the zth inter-arrival time^ and Sn the moment of the nth occurrence. 
We define, for all t > 0, N{t) as the number of occurrences up to and including 
time t, that is, we define 



1. N{0) = 0, 

2. For t > 0, N{t) = max{n : Sn < t}. 
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We call N{t) a Poisson process with intensity X. 

The intensity A of a Poisson process can be taken to be any positive number. 
The choice very much depends on the application that we have in mind. For 
instance, when we think about earthquakes in California and take the unit of time 
as a year, then A should not be so high, since the waiting times between successive 
earthquakes can be quite long. On the other hand, when we look at the moments 
that a radio-active material sends particles, with seconds as our time unit, then 
the intensity should be very high. 



7.2 Basic Properties 



In this section we prove a number of basic facts which should increase our un- 
derstanding of the Poisson process, and which also justify the definition, given 
our objectives as mentioned at the beginning of the previous section. First of all, 
we should note that we have computed the distribution of Sn in Example 5.7.1. 
Indeed, the distribution of Sn, being the sum of n independent exponentially 
distributed random variables with parameter A, is a gamma distribution and its 
density is given by 



fsA^) 



A" 



(n- 1)! 






for X > 0, and fsni^) = 0 for x < 0. We can use this to prove the following fact, 
which provides a link between Approach 1 and Approach 2 above. 



Proposition 7.2.1. We have 

P{N{t) ^ n) = 

n\ 



Proof. The key to the proof is the observation that the event N{t) > n is equivalent 
to Sn < t. Hence we can compute the distribution of N{t) via the distribution of 
Sn, as follows: 



P{N{t) = n) = P{N{t) >n)~ P{N{t) > n + 1) 
- P{Sn<t)-P{Sn^l<t) 



-I 



x„-Ax(^a:) 
0 ^' 



dx- Xe ^ - — ^dx. 
Jo 



With partial integration, the first integral can be written as 



/ 



Xe 



(n- 1)! 
which completes the proof. 



dx = e 






n\ 



f 



+ / Ae ^ ; — dx, 



n 



□ 
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The previous result tells us that the number of occurrences in the time in- 
terval (0,t) has a Poisson distribution with parameter Xt. The obvious question 
now is about the distribution of the number of occurrences in an arbitrary interval 
(s, s -h t). To study this, we need the following technical result: 

Lemma 7.2.2. For all k e N, t > 0, u e (0, t] and v > 0 we have 
P{t-u<Sk<t,t< Sk+i <t + v)^ ~ 

AC! 

Proof. Denote the joint density of Sk and Xk-^i by f. By independence, we have 

f{x,y) = fsAx)fx,+Ay)- 



We can now write 



P{t - u < Sk <t,t < <tpv) = 



Next, we fix some time t and suggest to study the last occurrence before t and 
the first occurrence after t. When N{t) = k, the last occurrence before t is Sk and 
the first occurrence after t is 5/e+i. Hence, the last occurrence before time t can 
be written as S'jv(t)- If there is no occurrence before time t, then 5jv(t) = 5o = 0 
by convention. Similarly, S'jv(t)+i represents the first occurrence after time t. 

4 Exercise 7.2.3. Make sure you really understand this. 

The time between t and 57v(t)+i is not an ordinary waiting time; it is the 
remainder of the current waiting time at time t. You might expect that this re- 
maining waiting time is in some sense ‘smaller’ than an ‘ordinary’ waiting time. 
However, the following surprising and fundamental result tells us that this is not 
the case. It tells us that the remaining waiting time at time t, and all subsequent 
waiting times are independent and have the same exponential distribution. 

Theorem 7.2.4. Let Vt := - t, and let for all i = 2, 3, . . ., Zi := Yjv(t)+z- 

Then for all m, the random variables Vt, Z 2 , . . . , Zm dre independent and identi- 
cally distributed, with an exponential distribution with parameter A. 



pt pt-\-V — X 

/ / f{x,y)dydx 

J t — u J t — x 

nt / ptA-V-X \ 

/ fsAx){ fx,+^iy)dy]dx 

Jt — u \Jt — X / 

(V)'' - (A(i -«))'= xt 



- e- 



□ 
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Proof. We express the distribution of Vt and the Z^’s in terms of the S^s and 
XnS as follows (using Lemma 7.2.2): 

P{Vt < Zi, Z 2 < Z2, ■ . . , Zm < Zm) 

00 

= ^ P{N{t) = k,Vt < Zi, Z 2 < Z2, ■ . ■ , Zm < Zm) 

k=0 

00 

— ^ ^ P{N (t) = /c, Sk-\-l t ^ ^ 2 : 2 , • • • 5 ^ ^m) 

k=0 

00 

— ^ ^ P{^k ^ ^ ^k-\-l ^ ^ "h -^fc+2 ^ -2:2 5 ^fc-fTTT, ^ ^m) 

k=0 

^ f \+\k ^ 

= E ri(i - 

k=0 ’ i=2 

m 00 

= l[il-e-^^^)Y,P{N{t)=k) 

i=l k=0 

m 

= n(l-e-"^0, 

i=l 

proving the result. □ 

The computations in the proof of Theorem 7.2.4 even show a little bit more: 

Theorem 7.2.5. For any m, the random variables N{t)^ Z 2 , . . . , Zm are inde- 

pendent. 

4 Exercise 7.2.6. Show how Theorem 7.2.5 follows from the proof of Theorem 
7.2.4. 



We are now ready to consider N{t + s) — N{t), which is the number of 
occurrences in an arbitrary interval {t,t + s). The following theorem shows that 
this number also has a Poisson distribution, and that the number of occurrences 
in (0,^) and {t,t + s) are independent. 

Theorem 7.2.7. For all t > 0, N{t + s) — N{t) has a Poisson distribution with 
parameter As. Furthermore, N{t) and N{t + s) — N{t) are independent. 

Proof. Let V^, Z 2 , Z 3 , . . . be defined as above, and write Zi — Vt. We observe that 
N{t + 5 ) - N(t) is equal to 

X {t T s) — X (if) — max-[i : Z\ T Z 2 T ■ ■ ■ H“ Z^ ^ sj-. 

Hence the distribution of N{t + 5 ) — N(t) is fully determined by the joint distribu- 
tion of Vt and the Z^’s. Since these are independent and exponentially distributed, 
the first claim follows immediately. 
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For the second claim, note that N(t + s) — N(t) depends only on Vt, Z 2 , . . 
and by Theorem 7.2.4, these random variables are independent of N(t) and the 
result follows. □ 

The independence result in the last theorem can be generalised considerably, 
as follows. 

Theorem 7.2.8. For any 0 < < ^2 < *•• < the random variables N{ti), 

N{t 2 ) — N{ti), N{ts) — N{t 2 ), . . . ^N{tn) — N{tn-i) are independent. 

Proof. The proof proceeds by induction. For n = 2, the result is true according to 
Theorem 7.2.7. Now suppose that the result is true for n = m, and consider times 
0 < < • • • < tm+i- Let Vt, Z 2 , Zs, . . . be as defined before with ti instead of t, 

and let Ti = Vt + Z 2 + '" + Zi, for all i. We use the notation 51 = ^2 ~ ^2 = 

ts — ti, • - ^ Sm = tm-\-i — t\. Then we can write 

P{N{ti) = ki,N{t2) - N{ti) = fc2, . . . , - N{tm) = A^m+l) 

— = hi,Tk2 <t2 — ti < T/e2 + l, . . . , 

Pk 2 -\ — tm-\-l tjYi <C Tj^ 2 ~\ h^m + l+l) 

= P{N{ti) = ki)P{Tk2 <t2~ti< T/e2+i, . . . , 

Pk2-\ \-km+i — ^m+1 ~ tm ^ ^/c2H hA^m+i + l) 

= P{N{ti) = ki)P{Sk2 <t2-ti < 5^2+1, . . . , 

^k2-\ \-km+i — ~ tm ^ ^k2-\ h/cm+i+l) 

= P{N{t,) = k^)P{N{si) = /C 2 , N{s 2 ) - N{si) = fcs, . . . , 

-N(^m) 1 ) — ^m+l) 

= P{N{h) = ki)P{N{si) = ^ 2 ) • • • P{N{Sm) - N{Sm-l) = fcm+l) 

= P{N{ti) = ki)P{N{t2) - N{h) = ^ 2 ) • • • P(iV(Wl) - N{tm) = fcm+l), 

proving the result. □ 

4 Exercise 7.2.9. Justify all the equalities in the last proof. Where do we use the 
induction hypothesis? 

At this point, it is perhaps good to look back at the points mentioned at the 
beginning of Section 7.1. We mentioned four characteristics of the type of process 
that we want to model, namely stationarity, independence, lack of memory and 
no accumulation of occurrences. At this point we have addressed all these four 
issues. Indeed, stationarity follows from the fact that the distribution of N{t + s) — 
N{t) does not depend on t, only on s, the length of the interval. Independence 
is addressed in Theorem 7.2.8, and lack of memory follows from Theorem 7.2.4. 
Finally, the fact that there is no accumulation of occurrences already follows from 
Proposition 7.2.1. Indeed, since the number of occurrences in a bounded interval 
has a Poisson distribution, the probability of having infinitely many points in a 
given interval is equal to 0. 
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The Poisson process defined and studied so far seems to be a very reasonable 
model for the type of process we have in mind. It is also a very interesting and 
subtle construction from a purely mathematical point of view, showing a nice 
interplay between discrete and continuous distributions. In the next sections, we 
shall explore some more of its properties. 



7.3 The Waiting Time Paradox 

The waiting times between successive occurrences have exponential distributions 
by construction. Nevertheless, there is something remarkable going on, as we can 
see when we take a close look at Theorem 7.2.4. 

Fix a time t, and look at the waiting time between the laist occurrence before 
t, and the first occurrence after t, that is, at 

= 5'iv(t)+i — 

At first sight, this does not seem to be a very interesting question, since all X^’s 
have an exponential distribution. However, the point is that the index N{t) + 1 is 
random^ and a little more thought reveals that random indices change the scene 
dramatically. To see that a random index can really make a big difference, consider 
the following example. 

Define the random variable M as the first index i for which Xi > 10. It is 
clear that Xm does not have an exponential distribution since P{Xm < 10) = 0. 

Hence, when we have random indices, the distribution might change. A look 
at Theorem 7.2.4 now does indeed tell us that Xjv(t)+i does not have an exponen- 
tial distribution with parameter A. To see this, note that 

— t) + - SN{t)) 

= Vt P {t — 

Since Vt itself has an exponential distribution with parameter A, Xjv(t)+i should 
in some sense be bigger than an exponential distribution, since we add t — 
as an extra factor. This fact is called the waiting time paradox. 

It must be stressed that the waiting time paradox is not an artefact of a 
Poisson process, but a much more general phenomenon. To see this, and to get 
a better understanding of the paradox, we consider a very different situation in 
which the waiting times are not exponential. 

Example 7.3.1. Consider random variables Xi, X 2 , . . ., independent and identically 
distributed with distribution 

^(^-‘>-{ 1/2 tlZm. 

and interpret Xi as the waiting time between the (i — l)st and the ith occurrence. 
As before, we let Sn = ZlILi = 0. Now let t be some large number 
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and consider the length of the waiting time between the last occurrence before t 
and the first occurrence after t. What is the distribution of this quantity? Well, 
although individual waiting times are always equal to 100 with probability 1/2 
and equal to 1 with probability 1/2, it is clear that the time periods with waiting 
time 100 cover a much higher fraction of the time axis than the time periods with 
waiting time 1. Therefore, it is much more likely that a given t is contained in a 
waiting time of length 100 than that it is contained in a waiting time of length 1. 
Hence, the expectation of X 7 v(t)+i is certainly larger than 50^. □ 

This example shows that the waiting time paradox is not restricted to Poisson 
processes, but is a general phenomenon, coming from the fact that the larger 
waiting times cover a larger fraction of the time axis than the smaller ones. 

In case of a Poisson process, we can do exact computations. We denote the 
time between t and the last occurrence before t by Ut, with the convention that if 
iV(t) = 0, then Ut — t. The time between t and the next occurrence was denoted 
earlier by Vt. We know already from Theorem 7.2.4 that Vt has an exponential 
distribution with parameter A. 



Theorem 7.3.2. The distribution of Ut is given by 

— for 0 < u < t, 

for u>t. 



P{Ut 



<«)={! 



Moreover, Ut and Vt are independent. 

Proof. First of all, we have that 

P{Ut =t,Vt<v)=P{t<Xi<t + v) = e-^\l - 
Furthermore, for 0 < u <t and v > 0 we find, using Lemma 7.2.2, that 



(7.4) 



P{Ut<u,Vt<v) 



= P{SN(t)>t-U,SN{t) + \<t + v] 

oo 

== y~^ P{N{t) = k,Sk>t-u, Sk+i <t + v) 

k=i 



— ^ - {\{t - u))'^ 

fe=i 

= (gAt _gA(t-«))(i _g-A«)_ 



Sending ^ oc in (7.4) and the last expression yields the distribution of Ut 
as in the statement of the theorem. Independence also follows immediately from 
the given expression plus the fact that Vt has an exponential distribution with 
parameter A. □ 



Note that Ut is a very natural example of a random variable which is neither 
discrete nor continuous; see Section 5.9. We can now compute the expectation of 
^AT(t)+i? representing the time between the last occurrence before t and the first 
occurrence after t. 
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Proposition 7.3.3. It is the case that 

2 e~^^ 

E{XN(t)-hi) = j 

and therefore, E(X 7 v(t)+i) ^ 2E{Xi) = j, as t ^ oo. 

Proof We have that E{Vt) = A“^. Furthermore, 

E{Ut) = [ xXe-^^dxPte-^^ 

Jo 

The result now follows immediately, since Xjv(t)+i = Ut PVt. □ 

All these results are quite intuitive. The distribution of Ut is an exponential 
distribution which is truncated at t because of the fact that at time t, the last 
occurrence before time t can not be more than t time units away. This truncation 
effect should disappear in the limit for t ^ oo, and indeed, the expectation of Ut 
does converge to the expectation of the exponential distribution with parameter 
A. 



7.4 The Strong Law of Large Numbers 

In this section we give an interesting application of the strong law of large numbers. 
At time t we have seen N{t) occurrences. The average waiting time between two 
occurrences is equal to A“^, the expectation of an individual Xi. Hence in a time 
interval of length t, we should expect around Xt occurrences, that is, we expect 
N{t) to be close to Xt. 

Theorem 7.4.1. (Strong law of large numbers) Let e > 0. Then 



P 



lim 

t—*oo 



m 

t 



= A 



= 1 . 



Proof. Observe that 



'jv(t) 5'jv(t)+i, 



and hence division by N{t) gives 



. N{t) iV(t)+l 



(7.5) 

(7.6) 



According to Theorem 6.3.2, we have that the left- and right-hand side of (7.6) 
converge to A“^, with probability 1. It follows immediately that also t/N{t) con- 
verges to A~^, with probability 1. □ 
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This law of large numbers can be used for statistical applications. If you have 
some process that you want to describe with a Poisson process, then you may want 
to get an idea about a suitable intensity A. The last result tells us that N{t)/t 
should with high probability be close to A. When we observe N{t), we can estimate 
A using this fact. 

4 Exercise 7.4.2. Do you think that it is easier or more difficult to estimate A 
when A gets larger? Motivate your answer. 



7.5 Exercises 

Exercise 7.5.1. An employee in a call center works from 8 a.m. until 5 p.m., with 
breaks between 10.30-10.45, 12.30-13.30 and 14.45-15.00. Assume that calls come 
in according to a Poisson process with expected number of calls per hour equal to 
6. 

(a) What is the probability that there are at most 10 calls during the breaks? 

(b) What is the probability that the first call of the day is after 8:10 a.m.? 

(c) What is the probability that the employee can do something else for 45 minutes 
without being disturbed by a call? 

Exercise 7.5.2. Let 0 < s <t. Compute the conditional distribution of N{s) given 
that N{t) — n. Do you recognize this distribution? 

Exercise 7.5.3. Compute the distribution of 

Exercise 7.5.4. Consider a Poisson process with parameter A. What is the condi- 
tional probability that N{1) = n given that N{S) = n? Do you understand why 
this probability does not depend on A? 

Exercise 7.5.5. Give an alternative proof of Lemma 7.2.2, using Theorem 7.2.8. 

Exercise 7.5.6. One can compute the exact distribution of X;v(t)+i? the time in- 
terval between the last occurrence before time t and the first occurrence after time 
t. Show that the density / of ATjv(t)+i is given by 

„ . . _ f X^xe~^^ for X <t, 

— I Xt)e~^^ for X > t, 

and compute the expectation of Ajv(t)+i using this density. 

Exercise 7.5.7. (Thinning a Poisson process) Let N{t) be a Poisson process with 
intensity A. For each occurrence, we flip a coin: if heads comes up we label the 
occurrence green, if tails comes up we label it red. The coin flips are independent 
and p is the probability to see heads. 

(a) Show that the green occurrences form a Poisson process with intensity Xp. 

(b) Connect this with Example 2.2.3. 
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(c) We claim that the red occurrences on the one hand, and the green occurrences 
on the other hand form independent Poisson processes. Can you formulate this 
formally, and prove it , using Example 2.2.3 once more? 

Exercise 7.5.8. (The inhomogeneous Poisson process) In the theory of this chapter, 
the intensity A was a constant. However, there might be situations in which it is 
more reasonable to allow for a varying intensity of the process. Instead of a fixed 
intensity A, we want to have a model in which A = A(t), a function of the time t. 

(a) Can you think of a number of examples where this might be a reasonable thing 
to do? 

(b) Define such an inhomogeneous Poisson process, and show that (under certain 
conditions) the distribution of the number of occurrences in the time interval (a, b) 
is given by a Poisson distribution with parameter \{t)dt. 

Exercise 7.5.9. (The Poisson process in higher dimensions) We can extend the 
idea of random occurrences of events to higher dimensions. Of course, we should 
not talk about occurrences in time anymore then, but instead talk about objects 
that we distribute over some two-dimensional set in some completely random way. 
Think of how you would define such a two-dimensional process. In particular, note 
that not all approaches to the Poisson process in this chapter can be generalised to 
higher dimensions. For instance, the waiting time does not have an interpretation 
in two dimensions. But the number of points in a given region can possibly again 
be described by a Poisson distribution. Conclude that in generalising the process 
to higher dimensions, not all approaches are equally suitable. 




Chapter 8 

Limit Theorems 



In this chapter, we will be concerned with some more general limit theorems. 
In particular, we shall generalise the central limit Theorem 4.2.1. The method of 
proof will also lead to a new formulation of the law of large numbers. The methods 
behind these results are not so easy. They rely on concepts from complex analysis. 
To make sure that you know what we are talking about, in Section 8.2 there will 
be a short introduction to complex analysis which contains all the background 
necessary for the development in this chapter. In this chapter, random variables 
can be discrete or continuous. We start by formalising a mode of convergence that 
we have, in fact, already seen. 



8.1 Weak Convergence 

When we have random variables X and there are several ways of expressing 
the idea that Xn should be close to X. For instance, when the random variables 
are defined on the same sample space, we can look at P{\Xn — X\ > e). This was 
the kind of ‘closeness’ that was used in the weak law of large numbers Theorem 
4.1.4, for instance. We want to emphasize that this type of closeness only makes 
sense when all random variables are defined on the same sample space. 

In this chapter, we will discuss another type of closeness, which does not 
require the random variables to be defined on the same sample space. The type 
of convergence that we have in mind is called weak convergence and is defined via 
the distribution function of the random variables. We think it is best to first give 
the formal definition and comment on the definition afterwards. 

Definition 8.1.1. Let X, Xi, X 2 , . . . be random variables with distribution functions 
F, Fi, F 2 , . . . respectively. We say that the sequence Xi, X 2 , . . . converges weakly 
toXif 



Fn{x) -> F{x), 
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for all X in which F is continuous, and we write this as 



Xn X or Fn F. 



Weak convergence is also called convergence in distribution in the literature. 
At first sight, this definition may appear somewhat unnatural, especially the fact 
that we do not require convergence in points of discontinuity of F. At this point, 
we cannot fully explain this, but the following example does indicate the reason 
for the fact that converence is not required at points of discontinuity of F. 



Example 8.1.2. Let Fn be defined by 



Fn 




0 

1 



if X < 
if X > 



1 

2 ’ 

n ' 



Fn is the distribution function of a random variable Xn which satisfies P{Xn = 

== 1. Clearly, when n ^ oo, the ditribution of Xn gets ‘closer’ to the distribution of 
a random variable X with P{X = 0) = 1. However, if we denote the distribution 
function of X by F, then we see that Fn{0) does not converge to F(0), as n ^ oo. 
Indeed, Fn{0) = 0 for all n, but F(0) = 1. □ 

4 Exercise 8.1.3. Let F^ be defined by 



F' 




0 

1 



ifx< 



Show that in this case, F^{x) F{x) for all x. 

The distribution functions F^ correspond to random variables Yn which sat- 
isfy P{Yn = — ^) = 1. Hence we see that if we had required convergence for all 
X, the random variables Xn would not converge weakly to A, but the Yn would. 
This would be a somewhat strange state of affairs, which is avoided by the way we 
defined weak convergence. We do realise though, that this does not fully account 
for the definition, and at this point we just would like to ask for some patience. 

Example 8.1.4. The Central Limit Theorem 4.2.1 is an example of weak conver- 
gence. □ 

The following example shows that it is possible that discrete random variables 
converge weakly to a continuous random variable. 

Example 8.1.5. Let Xn be a random variable taking the values {1/n, 2/n, . . . , 1}, 
each with probability 1/n, and let A be a uniformly distributed random variable on 
(0, 1). Denote the corresponding distribution functions by Fn and F, respectively. 
We claim that A„ A. To see this, take 0 < y <1, and observe that 



P{Xn <y) = 



[ny\ 

n 
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where [x\ denotes the largest integer which is smaller than or equal to x. Clearly, 
the right-hand side converges to y, as n ^ oo. For y < 0 and y > 1, we have 
Fn{y) = 0 and F^(y) = 1, respectively. It is now clear that Fn ^ F. □ 

In the following exercise it is shown that it is possible for continuous random 
variables X, Xi,X 2 , . . that Xn X in such a way that the densities do not 
converge. 



4b Exercise 8.1.6. Let Xn have distribution function 

^ . sin(2n7Tx) 

for 0 < X < 1. First show that Fn is indeed a distribution function and after that, 
show that Xn converges weakly to the uniform distribution on (0, 1), but that the 
density of Xn does not converge to the density of the uniform distribution. 



8.2 Characteristic Functions 

Characteristic functions are one of the main tools in studying weak convergence. 
In this section, we introduce these characteristic functions and derive some of their 
basic properties. Unfortunately, characteristic functions require a certain amount 
of complex analysis, not much, but perhaps just enough to scare you off. Don’t 
worry too much about this, we will define all the necessary machinery in the course 
of the developments, and some results from complex analysis we will just take for 
granted. 

First of all, we shall need the exponential function 



OO L. 

— 






k=0 



for all z G C. This function has the nice property of being its own derivative. The 
standard limit 

lim (l + -Y = e^, (8.1) 

n^oo \ n/ 

which we recognise from real analysis, remains valid in the complex setting. We 
also note the important identity 

= cost -h isint. 



For a continuous function / : (a, b) — > C, we define 

pb nb nh 

/ f{t)dt = / Kcf(t)dt Fi 

J a J a J a 



lmf{t)dt, 
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where Ref and Imf are the real and imaginary part of /, respectively. In particular 
we find that 

pb pb pb 

/ e^^dt— / eostdt-\-i / sin tdt. 

J a J a J a 

In view of this, it makes sense to define, for a random variable X, 

:= E{eostX) iE{sintX). 

This expectation is called the characteristic function of X. Characteristing func- 
tions are strongly related to the generating functions in Section 2.6. 

Definition 8.2.1. Let X be a random variable. The characteristic function of X, 
denoted by 4>x{t) : M ^ C, is defined as 

(j)x{t) = E{costX) -f- ijK(sintX), 

which we often denote by E{e'^^^). 

So, if X is a continuous random variable, then 

/ oo 

-CX) 

and if X is discrete, we obtain 

X 

Working with characteristic functions requires some knowledge about complex 
integrals. Some basic properties of these integrals are given in the next theorem, 
where / and g are complex- valued functions on R. 

Theorem 8.2.2. (a) + g){t)dt = f{t)dt + g{t)dt. 

(b) = for all A G C. 

(c) Jafi^)dt <J^\f{t)\dt. 

Proof (a) follows from the corresponding properties of real integrals after splitting 
the integral in a real and imaginary part. The details are left as an exercise. For 
(b), it is convenient to write A = Ai -h iA 2 , with Ai and A 2 in R, and also write 
/ = /i -f i/ 2 , with fi and /2 the real and imaginary part of / respectively. We 
then write 
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The proof of (c) is more subtle. We first write f{t)dt = re^^, where r is the 
modulus, and 9 the argument, as usual. Then we have 

r = f f{t)dt = f f{t)dt. 

J a J a 

Since r G R it now follows that the last integral is real, and is therefore equal to 
its real part. Hence we obtain that 




Before giving some examples of characteristic functions, we would like to 
mention (and prove) two important properties. 

Theorem 8.2.3. (a) If X and Y are independent, then 
(b) If a, b e R, and Y — aX + b, then 

Proof. For (a), we can use Theorem 5.6.12. Hence, we should again split everything 
in real and imaginary parts as follows. 

= E{{costX ismtX){costY -\-isintY)) 

= E{cos tX cos tY — sin tX sin tY) 

+i£'(sin tX cos tY + cos tX sin tY) 

= E {cos tX)E {cos tY) — E{smtX)E{sintY) 

-\-i{E {sin tX)E {cos tY) + E {cos tX)E {sin tY)) 

= {E{costX) iE{sintX)){E{costY) iE{sintY)) 

= <l>x{t)(t>Y{t)- 

For (b), we write 

(t)Y{t) = = E{e^^^e^^^^^^) 

= E^^E{e^^^^^^) = e^^^(l)x{at). □ 
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4 Exercise 8.2.4. Extend the first part of this theorem to more than two random 
variables. 



Example 8.2.5. Let X be a discrete random variable with P{X = 1) = p and 
P{X = 0) = l-p. Then = {1 - p) + p^K □ 

Example 8.2.6. If X has a binomial distribution with parameters n and p, then X 
can be written as the sum of n independent random variables as in the previous 
example. Hence, according to Theorem 8.2.3, we find that 

= (1 -P + Pe*‘)”- n 



^ Exercise 8.2.7. Let X have an exponential distribution with parameter 1. Then 
its characteristic function is In this case, we can com- 

pute this by repeated partial integration. For this, we first consider the real part 
cos txe~^dx. Compute this integral by doing partial integration twice. Do the 
same for the imaginary part, put things together and show that 



<f>x{t) = 



1 



1 — it 



4 

A, 



Exercise 8.2.8. Show that when X has a Poisson distribution with parameter 
we have 



= e 






These examples suggest that we can typically compute characteristic func- 
tions without too much trouble. Unfortunately, this is not the case. For instance, 
at this point we cannot compute the characteristic function of the normal distri- 
bution or the Cauchy distributions. We will come back to this soon. 

We think it is about time to explain why we should be interested in char- 
acteristic functions at all. Two major results form the basis for this interest, the 
inversion theorem and the continuity theorem. The inversion theorem explains the 
name characteristic function. Indeed, it tells us that the distribution of a random 
variable is completely determined by its characteristic function. 

Theorem 8.2.9. (Inversion theorem) (a) Two random variables X and Y have 
the same characteristic function if and only if they have the same distribution 
function. 

(b) If X is continuous with density f and characteristic function (j), then 






-itx 



(j){t)dt, 



at every point x at which f is differentiable. 

(c) If X takes values in N and has characteristic function <f>, then 

P{x ^k) = ~ f e-^^^4>{t)dt. 
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Proof. We will not give a proof of (a) here. We have two reasons for that. Firstly, 
the general proof is somewhat technical and perhaps not so suitable for a book 
at this level. Secondly (and more importantly), in this book we mainly use (b) 
and (c), which are refinements in two special cases of (a). The statement in (b) is 
nothing but the classical Fourier inversion theorem, and the proof can be found in 
any introduction to Fourier transforms. This is also true for the proof of (c), but 
since it is so simple, we prefer to give it here. The proof of (c) runs as follows. It 
is an easy exercise to show that 



f = 0 , 

J — 7T 

when A: 0, and of course equal to 27 t when A: = 0. Hence we can write 

P{x = k) = ^ y P{X = j) / 

= ^ I" e-^^^4>{t)dt, 

proving the result. 



□ 



4 Exercise 8.2.10. The careful reader will have noticed that in the last proof, we 
interchanged sum and integral. Do you see why this is allowed? 

Finally, the following continuity theorem articulates the relation between 
weak convergence and characteristic functions. 

Theorem 8.2.11. (Continuity theorem) Suppose that X, Xi,^ 2 ,... are random 
variables with characteristic functions 0, (/>i, (/> 2 , — Then 

Xn=^X 



if and only if 






for all t, when n oo. 

Perhaps it now becomes clear how we are going to prove limit theorems. What 
we need to do is to compute characteristic functions, and identify limits. This is 
easier said than done. In the next section, we will demonstrate an expansion of the 
characteristic functions which turns out to be useful for both purposes: computing 
characteristic functions, and computing limits. 
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8.3 Expansion of the Chciracteristic Function 

This section is rather technical in nature, but we want to emphasise that the 
computations are not very difficult. If you get tired of the computational details, 
then it will be enough to study the results without looking at the proofs. In fact, 
the proofs do not give much information at all, but we prefer to include them. 
Everything is based on the following estimate. 

Lemma 8.3.1. For all x E R and n>^ we have 



-E 



{ixY 



k=Q 



k\ 



< min 



rjx|^ 2\x\^ \ 

\(n + l)!’ n! / 



Proof. The proof is based on partial integration in the complex setting. We have 
not told you before that you are allowed to do that, but in fact the usual partial 
integration formula (under certain continuity assumptions) holds. 

The first step of the proof is to show by partial integration that 



Hx-sy 

Jo 



e^^ds = 



y.n+1 



+ 






n-\-\ is 



/o n + 1 n + ijo 

Prom this formula it is not hard to show by induction that 






E 



k=0 



{ixY 



+ 



•n+l rx 

^ Jo ^ 



X — s)'^e'^^ds. 



Hence, 



e — 



E 

k=0 



{ixY 



k\ 



;n+l px 



j {X- 



s)”e“ds 

ds 



ln+1 



(n+1)!’ 

accounting for the first term at the right-hand side of the lemma. 
Substituting n - 1 for n in (8.2) gives 



e^^ds , 

i 



Tfx - svy^ds = - ffx - s)"-i 
Jo ^ Jo 

and substituting this into (8.3) tells us that 

«" - E^ + ^f- 

^ n\\ijQ i 



fc =0 



E^{ixf c r. is 



( 8 . 2 ) 



(8.3) 
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We can now estimate the integral at the right-hand side (using that — 1| < 2) 
in the same way as above, giving the second term at the right-hand side in the 
lemma. □ 

Having done this unattractive and somewhat tedious computation, we can 
now enjoy the corollaries: 

Theorem 8.3.2. Let X he a random variable for which E{\X\'^) < oo. Then, 

k=0 



where P{t)/t'^ 0 as t ^ 0. 

Proof. We give the proof for continuous random variables. 






/e=0 



-x’" fx{x)dx 






fx{x)dx 



r 

V k=0 

/ oo 

k=0 

r ^ I IN”+‘ 2|te|“l , , 



< 



The minimum in the last formula is clearly bounded above by fx{x)‘^\x\^ and by 
assumption, the integral of this last function is finite. Since the first term inside 
the min goes to 0 as t ^ 0, the integrand tends to zero, and by the dominated 
convergence Theorem 3.2.4, the whole integral tends to 0. This proves the result. 

□ 

The next result can be used to compute the characteristic function of a 
normal random variable. 

Theorem 8.3.3. Let X have characteristic function cj), and let t be such that 

t^E{\X\^) 



lim 

n— >oo 



0 . 



nl 



(8.4) 



Then has the expansion 



k=0 



4 Exercise 8.3.4. Prove this result, using the proof of the previous theorem. 
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Here follows a nice application of the last theorem. 

Example 8.3.5. In this example, we compute the characteristic function of the 
normal distribution. In Section 8.5 we shall see how important this computation 



Let X have a standard normal distribution. First, we want to compute 
It is easy to see that this is zero for odd k, and we therefore concentrate 
on even k. Integration by parts shows that 

r [ x'^-^e-^^/'^dx. 

V27T J-oo v27T J-oo 

(To do this partial integration, observe that the integrand on the left can be written 
as • xe~^ /^.) Now E{X^) = 1, and the formula tells us that E{X‘^^) = 
{2k — 1)E{X‘^^~‘^). Hence we find that 

E(X^^) = l-3-5---(2/c-l). 



From this it follows that (8.4) is satisfied for all t, and Theorem 8.3.3 applies. It 
follows that 

oo / • I \ 2/c 

'I'M = 



’ 1-2-3---2A; 

k=0 






4b Exercise 8.3.6. Compute, using Theorem 8.2.3 and Example 5.3.6, the charac- 
teristic function of a random variable with normal distribution with parameters p 
and (j^. 



8.4 The Law of Large Numbers 

Now we have set up things in such a way that we can state and quickly prove our 
next law of large numbers. This law will be in terms of weak convergence. The 
main reason to include this law is to illustrate a certain technique. After stating 
and proving the result, we will compare this law of large numbers to Theorem 
4.1.4. 

Theorem 8.4.1. (Law of large numbers) Let Xi,X 2 , . . . be independent random 
variables with the same distribution, and suppose that they have finite expectation 
p. Let Sn — X\ “h • • • T Xn. Then 

Sn 

^ /i, 

n 

that is, P{Sn/n < x) ^ 0 if x < p and P{Sn/n < x) ^ 1 if x > p. 
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Before proving this result, let me state a small technical lemma that turns 
out to be useful in the proof. 

Lemma 8.4.2. Let zi, ... ^ Zm z[^ . . . ^z'^ he complex numbers of modulus at 
most r. Then 

m 

lziZ2 ■■■Zm- Z{Z2 ' ' ' z'J < “ 4l- 

k=l 

Proof. This can be proved by induction. For m = 1 the result is clearly true. 
Suppose that the result is true for m — 1. Note the following identity: 

^1^2 ■■■Zm- Z\Z2 ■ • • = (Zl - z'^{Z2Z‘i ■ ■ ■ Zm) + z[{Z2Z3 ■ ■ ■ Zm ~ 44 ' ' ' ^m)' 

Hence, we have that 

\zi---Zm- z[---z'^\ < |(2l -2i)(22---2rr»)| + |4(^2'--2m-4-"^m)l 

< -4|+r|z2--'2:m-4---^m)l 

m 

k=l 

according to the induction hypothesis. □ 



Proof of Theorem 8.4.1. The method of proof is perhaps clear by now. We need 
to show, by the Continuity Theorem 8.2.11, that the characteristic function of 
Su/n converges to the characteristic function of the constant /i. The characteristic 
function of Sn/n can be computed with the help of Theorem 8.3.2 as follows. 

Let <j){t) be the characteristic function common to the X^’s. The characteristic 
function of Xi + • • • + Xn is then equal to according to Theorem 8.2.3. 

According to the same theorem, the characteristic function of Sn/n is then equal 
to <p{t/n)^. According to Theorem 8.3.2, as n ^ oo (which implies that t/n 0), 




where (3{t / n) / [t / n) 0, as n — > oo. It then follows that 



n 






t 

n 




(8.5) 



as n — > 00 by the fact that P{t/n)f{t/n) — > 0. Now we apply Lemma 8.4.2 with 
m — n, Zk = I + {int)/n, = (f>{t/n) and r = 1 + nt/n. It then follows that 



A + ^ 

n / V ^ 



< (i + - 

n 



n—l 



n 



<P[- 

n 



1 + 



ifit 



n 



which goes to 0, using (8.1) and (8.5). Since (1 + (ipit)lnY (according to 

(8.1)), it follows that also (f){t/nY which is the characteristic function of 

the constant fi. □ 
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4b Exercise 8.4.3. It is not so easy to compute the characteristic function of the 
Cauchy distribution. One needs so-called contour integrals for this. But let me tell 
you that this characteristic function is 

(j){t) = 

Let Xi,...,Xn be independent, all having a Cauchy distribution. Show that 

{Xi H h Xn)ln again has a Cauchy distribution. Why does this not contradict 

Theorem 8.4.1? 

It is now interesting to compare Theorem 4.1.4 to Theorem 8.4.1. There 
is a sense in which the last theorem is in fact a consequence of the first. This 
follows from the following general result, which compares two different modes of 
convergence. 

Theorem 8.4.4. Let X, Xi,X 2 , . . . be defined on the same sample space. If 

P{\Xn-X\>e)^0, 

for all e > 0, then 

Xn^X. 

Proof Denote the distribution functions of X and Xn by F and Fn respectively. 
Then we write, for any e > 0, 

Fn{x) = P{Xn < X) 

= Pi^Xji ^ X, X ^ X -f- c) T Pi^Xyi ^ X, X ^ X “h c) 

< P(X<x + e) + P(|X-X,| >6) 

- F{xFe)FP{\X-Xn\>e). 

Similarly we have 

F{x — e) = P{X < X — e) 

= P{X < X — 6, Xn < x) + P{X < X — e, Xn > x) 

< P{Xn < x) F P{\X - Xn\ > e) 

= Fn{x) + P{\X-Xn\>e). 

When we combine these two estimates, we obtain 

F(x - 6) - P{\X - Xn\ >e)< Fn{x) < F{x + 6) + P{\X - Xn\ > c). 

Now let n ^ 00 . We then obtain 

F{x - e) < liminf Fn(x) < limsupF^(x) < F{x + e). 

n— ^oo n—*’Oo 

Now assume that x is a continuity point of F. In that case, sending 6 — > 0 leads 
to Fn{x) F(x). Hence we have shown pointwise convergence of Fn to F in 
continuity points of F, and the proof is complete. □ 
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As a consequence of this result, Theorem 8.4.1 follows from Theorem 4.1.4 if 
all random variables are defined on the same sample space. If they are not defined 
on the same sample space, then it does not make sense to talk about \X — Xn\, and 
Theorem 4.1.4 does not apply. However, Theorem 8.4.1 does apply in this case, 
and it is in this sense that the last result is, strictly speaking, not a consequence 
of the first. 



8.5 The Central Limit Theorem 

I was once told that the central limit theorem is the most quoted result in math- 
ematics^ not only in probability. Whether this is true or not I do not pretend to 
know, but it is certainly true that the central limit theorem does play a crucial 
role in probability. 

In this introductory text, we will only state and prove the most basic central 
limit theorem. This result expresses the idea that when you take the sum of many 
independent random variables with the same distribution, this sum approximately 
has a normal distribution, regardless of the distribution of the summands! This 
is quite a startling fact: one can take Poisson random variables, add them up, 
and the result will be roughly normally distributed. If one starts with exponential 
random variables, the conclusion will be the same. Here is the formal result. 

Theorem 8.5.1. (Central limit theorem) Let Ai, A 2 , . . . be independent random 
variables with the same distribution, and suppose that their common expectation 
jx and variance are both finite. Let = Ai + • • • + A^. Then 

cr^/n 

where N denotes a random variable with a standard normal distribution. 

Proof. The idea of the proof is the same as the proof of Theorem 8.4.1. Writing 

— ? 
a 

we see that E(Yi) = 0 and E{Y^^) = 1. We write (j) for the characteristic function 
of the Yfs. It follows from Theorem 8.3.2 that 

Ht) = 1 - 

where /3{t)/t‘^ 0 as t 0. Since 



Sn — nfj, _ 1 ^ 

1=1 



(8.6) 
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we see from Theorem 8.2.3 that the characteristic function of (8.6) is given by 




The proof is now finished in exactly the same way as the proof of Theorem 8.4.1. 



□ 



4 Exercise 8.5.2. Provide the missing details in this proof. 

The following example is a typical statistical application of the central limit 
theorem. 



Example 8.5.3. It is conjectured that one can see from a person’s handwriting, 
whether this person is a man or a woman. Suppose that someone inspects 1500 
examples of handwriting and that he or she assigns the right sex to the handwriting 
950 times. Is the conjecture plausible and why? 

To say something about this, imagine that there would be no difference in 
general between the handwriting of men and women. In that case, it is only rea- 
sonable that out of 1500 trials, the number of correct guesses would be roughly 
750. So in that sense, 950 correct guesses seems to be evidence for the idea that 
there is, in fact, a difference. A standard way to quantify this, is to ask the ques- 
tion how likely it would be, under the condition that there is no difference, to 
make at least 950 correct guesses. (Clearly the probability to make exactly 950 
correct guesses would be very small.) If we assume that there is no difference, 
the number of correct guesses can perhaps be written as X = where 

P{Xi = 1) = P{Xi = 0) = ^. Hence this number of correct guesses is a sum of 
1500 independent random variables with the same distribution, and the central 
limit theorem applies. The expectation of the X^’s is ^ and the variance is 
Hence 



X-750 

IVim 



is (approximately) distributed as a standard normal random variable. We may 
now write 



P{X > 950) = 






^ / X - 750 ^ 950 - 750 
P{N > 10.32), 



where A is a standard normal random variable. This last probability can be es- 
timated numerically with a computer or a table, and this number turns out to 
be almost 0. The conclusion is then the following: if there is no difference in the 
handwriting, then the probability to guess correctly at least 950 times is about 0. 
Therefore, it seems safe to conclude that there is, after all, a difference in hand- 
writing between men and women. □ 
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Exercise 8.6.1. Let Xi, X 2 , . . . be independent and uniformly distributed on (0, 1). 
Let Mn be the maximum of Xi, . . . , X^. Finally, let 






Show, without using characteristic functions, that X, where X has an 

exponential distribution with parameter 1. 

Exercise 8.6.2. Show that the sum of n independent Poisson distributed random 
variables with parameter 1, has a Poisson distribution with parameter n. 

Exercise 8.6.3. Use the previous exercise and the central limit theorem to show 
that if X has a Poisson distribution with parameter n, then 

P{X <n)^\, 



and use this to show that 



lim e"” 

n— >oo 



n n n 



1 

2 ' 



Exercise 8.6.4. Let Xi,X 2 , . . . be independent Poisson distributed random vari- 
ables with parameter Let Xn = ^ average. Compute 

lim P{Xn < 1), 

n— >00 

using the law of large numbers. 

Exercise 8.6.5. Consider 1000 light bulbs, whose lifetime can be modelled with 
an exponential distribution with expectation five days. Approximate, using the 
central limit theorem, the probability that the total lifetime of the light bulbs 
exceeds 5200 days. 

Exercise 8.6.6. Suppose that X^ has a geometric distribution with parameter 1/n. 
Show that 

n 

as n ^ 00 , where F is a random variable with an exponential distribution with 
parameter 1. 

Exercise 8.6.7. Suppose we throw a coin 10,000 times, and 5,273 heads come up. 
Discuss the question whether or not the coin is fair, using the central limit theorem. 

Exercise 8.6.8. A random variable X is called symmetric if X and —X have the 
same distribution. 

(a) Give an example of a symmetric random variable. 

(b) Show that a random variable is symmetric if and only if its characteristic 
function is real. 




Chapter 9 



Extending the Probabilities 



In this chapter we discuss how to extend the collection of events in rather general 
situations. This small chapter is the bridge between probability without measure 
theory and probability with measure theory. 



9.1 General Probability Measures 

In Section 6.1 and Section 6.2 we extended the probabilities in a rather ad hoc way. 
It turns out that there is a more general procedure, which assigns probabilities 
at once to classes of sets that are so extensive that most of its members never 
actually arise in probability theory. This procedure requires some knowledge of 
measure theory. The purpose of the current section is to describe the ideas involved, 
without going into measure theory itself. This section is supposed to form the 
bridge between the first course in probability as set out in this book, and a future 
course in probability theory based on measure theory. 

What does a general theory of probabilities look like? The principal setup, 
with a sample space of possible outcomes, does not change. This sample space 
can be finite, countably infinite or uncountably infinite. In the general theory, we 
do not restrict ourselves to or the sample space can be more exotic, for 
instance a space of functions with a certain property. The sample space is usually 
denoted by fi, as in this book. 

We have seen in this book, that it is in general impossible to assign a well- 
defined probability to all subsets of We avoided this problem by restricting our 
attention to special classes of subsets, called events. But the route we followed, 
although very useful in practice and for a course on this level, is not enough for 
general probability theory. 




178 



Chapter 9. Extending the Probabilities 



In the general context, one proceeds as follows. First, one identifies a collec- 
tion of subsets of Q with three basic properties: 

1. G .7^5 

2. A e T implies ^ T\ 

3. A,Be!F implies AU B e T. 

4 Exercise 9.1.1. Show that these properties imply that (4) 0 G .T' and that (5) 
A,B ^ T implies An B e !F. 

Any collection T which satisfies (l)-(3) is called an algebra or a field. The 
idea behind this definition is that the collection of subsets which will receive a 
probability, should at least satisfy certain basic requirements: If we can speak 
about the probability of an event A, then it is only natural to require that also the 
probability of be defined. And if we can speak about events A and B, then it is 
only natural to require that also the union AuB, corresponding to the occurrence 
of A or B is an event, and similarly for the intersection. Here are two examples. 

Example 9.1.2. The collection of sets which can be written as a finite, disjoint 
unions of intervals in (0, 1) forms an algebra. □ 

Example 9.1.3. Let ft = {0,1}^, the set of infinite sequences of Os and Is. A 
cylinder set is a subset of ft of the form 

G ft : uJi-^ = , . . . , = kjYi ^ , 

for ii, . . . , G N and fci, . . . , G {0, 1}. The collection T defined as finite unions 
of cylinder sets, together with the empty set, is an algebra. □ 

The next stage in the development of a general theory is to define a probability 
measure P on an algebra T . This is a function P : ^ ^ [0, 1] with the following 
properties: 

1. P(0) = 0, P(0) = 1; 

2. if Ai, ^ 2 , . . . is a disjoint sequence of sets in T and if £ Pj then 

( oo \ oo 

/c=l / k = l 

This last property is called countable additivity of P. By taking A^+i = A^ 4_2 = 
• • • = 0, countable additivity implies finite additivity: 

\k=i / k=i 

The point of this part of the setup is that very often it is not too difficult to define 
a reasonable probability measure on an algebra. We again illustrate this with some 
examples. 
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Example 9.1.4. Consider the algebra from Example 9.1.2. If we let, for A e 
P{A) be the sum of the lengths of the intervals that make up A, then one can 
show that P is a probability measure. □ 

4|b Exercise 9.1.5. Verify this claim. 

Example 9.1.6. In Example 9.1.3, we can for instance define a probability measure 
on the algebra by putting 



4 Exercise 9.1.7. Show that this is a probability measure. 

So far, the theory is not difficult. We have defined, in an abstract setting, an 
algebra and a probability measure on such an algebra. Often however, an algebra 
does not contain ‘enough’ events. Therefore, we want to extend the probability 
measure to a much larger class of subsets. This is the final and most difficult 
step in the general theory. In order to describe this last step, we need one more 
definition. 

Definition 9.1.8. A collection of sets P is called a a- algebra if it is an algebra, and 
if it is also closed under the formation of countable unions: 

CX) 

Ai , A 2 , . . . G P implies [jAkeJ^. 

k=l 

A probability measure on a a-algebra is defined as on an algebra, the only difference 
being that in (2), we do not need to require that the countable union is in P, as 
this is automatically the case in a cr-algebra. 

Given an algebra P, we define the cr-algebra generated by P, as the smallest 
cr-algebra that contains P. This generated cr-algebra contains all sets that can 
be obtained from sets in P by taking countably many unions, intersections and 
complements. 

4 Exercise 9.1.9. Can you prove this last statement? 

The last step of the general theory may now be compactly stated in the 
following way: 

Theorem 9.1.10. (Extension theorem) A probability measure on an algebra has a 
unique extension to the generated a -algebra. 

This theorem contains two statements: we can extend the probability measure 
to a much larger (in general, at least) class, and we can do this in only one way, 
i.e., there is uniqueness. The proof of this theorem is quite lengthy and difficult, 
and it would not be appropriate to include it in this book. Note however that this 
construction immediately assigns a well defined probability to all sets that can 
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be expressed as countable unions, intersections and complements of sets in the 
algebra one starts with. 

So, for instance, if we want to make a satisfactory model for infinitely many 
coin flips, then we start with the algebra from Example 9.1.3, and define the 
‘obvious’ probability measure on this algebra. The extension theorem then tells us 
that this uniquely defines a probability measure on the cr-algebra generated by this 
algebra, and this cr-algebra contains all sets that could ever be of interest to us. 
If, on the other hand, we want a satisfactory model for choosing a random point 
in (0, 1), then the algebra in Example 9.1.2 is appropriate as a starting point. 

Hopefully, this short description provides enough motivation to learn more 
about measure theory and probability theory based on it. 
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Interpreting Probabilities 



In this appendix, we would like to pay some attention to the practical use and 
interpretation of the theory that we have discussed in this book. Probability theory 
is one of the most frequently used branches of mathematics, and perhaps also one 
of the most abused. This is not the place to discuss at length the philosophical 
issues which arise from using an abstract theory in practice, but it is worthwhile, 
I think, to make a number of points. 

The very first point to realise, is that what we have done in this book, is to 
set up a mathematical models which hopefully gives a reasonable description of 
some aspects of daily life. For instance, we all know that when we throw a coin 
repeatedly, that the relative frequency of heads converges to In that light, it is 
reassuring that in our model we can prove laws of large numbers that agree with 
this empirical fact. It is, of course, not the case that we have proved anything about 
daily life. An empirical law is not part of mathematics, and there is nothing to 
prove there. All we can conclude from the fact that experience agrees with theory, 
is that the model is a good model, as far as this particular aspect is concerned. 

When we set up a model, the role of this model is to describe a certain aspect 
of reality. If it turns out that the model behaves satisfactorily (as is clearly the 
case with repeated coin flips), then we might even turn to our model to make 
predictions. This is for instance what casinos do. Casinos know that the laws of 
large numbers work very well in real life, so by making their games slightly un- 
favourable for their customers, casinos are bound to make a lot of profit. Similarly, 
airline companies purposely overbook their flights by a certain percentage. They 
simply know, according to the law of large numbers again, that they will have to 
reimburse a certain number of people because of this, but overbooking makes sure 
that the aircrafts will be full. A computation then easily shows which overbook- 
percentage is optimal. Insurance companies also use the law of large numbers to 
calculate risks. 
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In all these cases (you can easily come up with other examples, we guess) 
it turns out that probability theory is well capable of describing and predicting. 
Why is this so? Why is probability theory so incredibly useful? 

When we construct a mathematical model involving randomness, this is very 
often done because we do not have full information about the process to be mod- 
elled, and we interpret the unknowns as being random. In other words, the things 
that we can not be sure of, are modelled as being random. The number of cus- 
tomers in a shop during an hour is something we cannot foresee, and therefore we 
model this by a random variable. It has been a long term philosophical question 
whether or not this number is truly random, or only perceived as being random 
by us, humans. 

I do not think that this is a very meaningful question, for the simple reason 
that I would have no idea how to define the word ‘random’ outside a mathematical 
context. Whenever we think of processes as being random, like winning a lottery, 
throwing a 6 with a die, or meeting someone on the street, as soon as we associate 
this apparent randomness with probabilities, we are, in fact, already working and 
thinking in a mathematical model. For instance, when you claim that the proba- 
bility of a certain rare event is no more than 1 out of 10,000, then in fact you talk 
about a model which is supposed to describe this event. This can easily lead to 
misunderstanding and abuse of probability theory. I do not think it is appropriate 
to dismiss very remarkable events by saying that probability theory tells you that 
rare events will occur eventually, no matter how small their probability. Saying 
this is turning things upside down: When you want to say something about such 
a very rare and remarkable event, you first make a model, in which this event has 
nonzero, but small, probability. If the event then occurs, you might be tempted to 
appeal to the model and say: ‘well, it is not so remarkable that the event occurs, 
because my model assigns positive probability to it’. But this is a circular reason- 
ing, since you have designed the model yourself, including the nonzero probability. 
Hence I would rather not dive into troubled water, and prefer not to make any 
statement about randomness in real life, whatever that may be, and no matter 
how one should or would try to define it. 

This means that I do not think that the role of probability theory is to model 
random events. As I have tried to explain, this would be a meaningless statement, 
given the fact that I do not know what the word ‘random’ means. Instead, I prefer 
to see probability theory as a way to deal with complex phenomena which would 
have been hard or even impossible to describe otherwise. The use of randomness in 
a mathematical model is a conscious choice of the person who designs the model, 
and has, in principle, nothing to do with the driving forces that are behind the 
process described by the model. 

I think that this is a very important point. One can ask when this is appropri- 
ate, I mean when it is appropriate to use randomness in modelling a phenomenon. 
I think there is no general, clear-cut answer to this question, but I have a rather 
pragmatic view: I consider randomness in modelling appropriate when experience 
with the model shows that the model works. Our experience with modelling with 
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randomness shows that it works very often and very well. But this point of view 
implies that I am quite critical towards a fair number of examples where proba- 
bility theory is used to draw conclusions. For instance, when someone claims that 
the probability of life elsewhere in the universe is large, because there are so many 
planets on which life could possible emerge, then I do not think that this is a 
statement that can be taken seriously. The reason is simple: there is no way to 
check whether or not the model used is, in fact, describing things well. We cannot 
check whether the model makes sense. I have similar problems with probabilistic 
statements about the possibility of a big flood in the coming 100 years or so. How 
can I convince myself that the model used here is any good? 

So stay alert when people make probabilistic statements about future devel- 
opments. Probability theory is a very useful tool, but should be used with care 
and attention. 




Appendix B 

Further Reading 



After the introduction in this book, there are a number of ways to proceed. If 
you are interested in the foundations of probability theory in terms of measure 
theory, then Billingsley’s excellent Probability and measure (3rd edition Wiley 
1995) is a very good idea. In this monumental book, all necessary measure theory 
is developed along the way, motivated by probabilistic questions. Another good 
possibility in this direction is Foundations of modern probability by Kallenberg 
(Springer 2002). Both these choice are quite demanding. 

Breiman’s classic Probability (Addison- Wesley 1968) is still a very good choice 
to continue reading, and a little easier to digest. For a very original approach you 
can try Probability with martingales by Williams (Cambridge University Press 
1992) which also includes the necessary details of measure theory. For a book of 
roughly the level of this book, consult Probability and random processes by Grim- 
mett and Stirzaker (Oxford 1993). Feller’s classics An introduction to probabil- 
ity theory and its applications I and II (Wiley 1978) are still very much worth 
the effort. More recent is A user’s guide to measure theoretical probability of 
Pollard (Cambridge University Press 2001), also aiming at an audience with- 
out measure-theoretical background, but making a different choice by providing 
measure-theoretical details (without proofs) along the way. Durrett’s Probability: 
theory and examples (Duxbury Press 1995) is a somewhat demanding but reward- 
ing book, containing a lot of interesting examples illustrating the theory. Perhaps 
you will find Gut’s An intermediate course in probability (Springer 1995) useful as 
a follow-up on the current book. 

For a very pleasant introduction to some simple but beautiful stochastic 
processes, we recommend Markov chains by Norris (Cambridge University Press 
1997). For generalisations of the Poisson process in the framework of so-called 
renewal processes, it is a good idea to read Karlin and Taylor A first course in 
stochastic processes (Academic Press New York 1975). There even is a very nice 
little book solely devoted to Poisson processes in arbitrary spaces, Kingman’s 
Poisson processes (Oxford 1993). For more details on branching processes, a good 
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option is Branching processes by Athreya and Ney (Springer 1972). Many of the 
above-mentioned books contain a large number of references to the literature. 




Appendix C 

Answers to Selected Exercises 



Chapter 1: 1.7.3 (a) 1/6, (b) 1/9, (c) 1/12; 1.7.4 4(^®)/(®2); 1,7.5 (a) 5/12, (b) 5/12, 

(c) 4/11; 1.7.6 (a) 29/32, (b) 9/16, (c) 81/256; 1.7.8 17/27; 1.7.9 ( 5^) ( 

1.7.11 1/n; 1.7.12 333/1000, 1/5, 9/1000; 1.7.14 1/16; 1.7.17 no; 1.7.18 (8/21)/(8/21 

+ 3/28); 1.7.20 © ( 2®) ( 4°)~^ 1-7.25 1/3; 1.7.29 1/3, 1/2; 1.7.31 (a) 1/3, (b) 1/5. 

Chapter 2: 2.7.3 1 - 5e~^; 2.7.7 (b) F(X = kjD) = 1/3 for A; = 6, 10, 12; 2.7.10 (a) 
e~\ (b) l/2 + l/2e-2, (c) 1; 2.7.13 r/p; 2.7.14 (a) 18/38+20/38(18/38)2 = 0.5917; 
2.7.16 (a) 3/5, (b) 2/3, (c) E(XIN = n) = n/2, E(X) = 1/3; 2.7.17 (c) no, (d) 6,5; 
2.7.18 E{X\Y = 4) = 5,E{Y\X = 2) = 3; 2.7.24 (b) 7/30; 2.7.26 the marginals 
are binomial with parameters n and p*; 2.7.29 fc/2; 2.7.37 (a) Z = 2X — 5, (b) 
E{X) = -5/3. 

Chapter 4: 4.3.1 (c) the random variables are not independent. 

Chapter 5: 5.12.2 (a) t = \/2, (b) 4y/2/5, (c) 3/4; 5.12.3 a and b satisfy a+ = 1 
and ^a+ \b = f ; 5.12.4 (a) 3, (b) 3/4, (c) 7/8, (d) 3/80; 5.12.5 (a) 1/2, (b) 
7 t/ 2; 5.12.7 (a) 60, (b) 11/32; 5.12.12 (a) A/(A + p); 5.12.18 (a) /y(p) = ye~y 
for y > 0, (c) E{X\Y = y) = y/2, E{X) = 1; 5.12.19 (a) fx{x) = e~^ for 
2; > 0, fviy) = {y + 1)“^ for y > 0, (c) 2/{y + 1); 5.12.20 yes; 5.12.21 (c) 1/x; 
5.12.22 (a) f{x,y) = 2{x^ - 4t/)-i/ 2 for X € (0,2). y € (0,1), > 4p; 5.12.23 

fy{y) = 2-7r“^(l — 2/^)“^^^ for y € (0,1); 5.12.25 f{w,v) = — ^logw ■ for 

w,v e (0,1); 5.12.31 1/8; 5.12.33 (2A)-i; 5.12.34 f{z) = 2 for 0 < 2 < 1 and 
f{z) = 2 — 2 for 1 < 2 < 2; 5.12.36 fz{z) = e~^ for 2 > 0. 

Chapter 7: 7.5.1 (a) P{X < 10) where X has a Poisson distribution with parameter 
9, (b) e-\ (c) e-45; 7.5.2 P{N{s) = k\N{t) = n) = (f)''(l - f)"-''(fe), the 
binomial distribution with parameters n and s/t. 

Chapter 8: 8.6.4 1. 
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cr- algebra, 179 
generated, 179 

Abel’s theorem, 59 
algebra, 178 
arc-sine law, 72 

ballot theorem, 69 
Banach-Tarskii paradox, 85 
Bayes’ rule, 17, 24 
binary expansion, 84, 132 
binomial coefficient, 6 
branching process, 141 
extinction of, 143, 144 
bridge, 9 

Buffon’s needle, 129 

Cantelli’s inequality, 66 
Cantor set, 145 

Cauchy-Schwarz inequality, 46 
central limit theorem, 79, 173 
local, 82 

characteristic function, 164 
and independence, 165 
expansion of, 169 
of binomial distribution, 166 
of Cauchy distribution, 172 
of exponential distribution, 166 
of normal distribution, 170 
of Poisson distribution, 166 
Chebyshev inequality, 46, 124 
coin tossing, 8 
fair, 8 

general, 20, 39 
combinatorics, 6 



complement, 4 
conditional, 53, 121 
density, 121 
distribution, 53, 120 
distribution function, 121 
expectation, 54, 121 
probability, 12, 53 
continuity theorem, 167 
contour integral, 172 
convergence, 72 

in distribution, 72, 162 
weak, 161 
countable, 2 
countably infinite, 2 
counting, 6 
covariance, 46, 116 

of bivariate normal distribution, 
116 

darts, 88, 89 
de Mere’s paradox, 31 
density, 90 
disjoint, 5 

pairwise, 5 
distribution, 38, 96 
binomial, 35 

bivariate normal, 106, 122, 127 
Cauchy, 98 

double-exponential, 126 
exponential, 98 
gamma. 111, 152 
geometric, 35 
multinomial, 64 
negative binomial, 36 
normal, 97 
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Poisson, 35 

truncated exponential, 157 
uniform, 96 

distribution function, 34, 95 
joint, 50, 104 
properties of, 36 
DNA profile, 23 
dominated convergence, 73, 92 
drawing, 6 

with replacement, 6 
without replacement, 7 

envelope problem, 57, 102 
estimate, 159 
event, 3, 90 

expectation, 41, 59, 99, 112 
of a function, 42, 113 
of a mixture, 117 
of a sum, 43, 112 
of binomial distribution, 47 
of Cauchy distribution, 102 
of exponential distribution, 100 
of geometric distribution, 48 
of negative binomial distribution, 
62 

of normal distribution, 101 
of Poisson distribution, 48 
of uniform distribution, 101 
experiment, 3, 90 
exponential series, 22 
extension theorem, 179 

generating function, 58, 142 

of geometric distribution, 59 
geometric series, 48 

inclusion-exclusion, 11 
independence, 18, 39, 51, 60, 105, 133, 
147 

pairwise, 19 
indicator function, 52 
infinite repetitions, 131 
inter- arrival time, 151 
intersection, 4 



inversion theorem, 166 
island problem, 23 

Jacobian, 108 

lack of memory, 62, 126 
law of large numbers, 26, 124, 170 
strong, 136-138, 158 
weak, 75, 76 
left to right crossing, 22 

marginal distribution, 50, 104 
Markov inequality, 46, 139 
master mind, 7 
mixture, 117 
model, 147 

networks, 21 
dual, 22 

numerical integration, 124 

partition, 15 
Poisson process, 152 

in higher dimensions, 160 
inhomogeneous, 160 
thinning of, 159 
predictions, 181 
probability density, 90 
probability mass function, 3, 34 
joint, 49 

probability measure, 3, 90, 178 

countable additivity of, 11, 178 
finite additivity of, 11 

random variable, 33, 94, 135 
continuous, 94 
discrete, 94 
functions of, 107 
symmetric, 175 
random vector, 49, 103 
continuous, 104 
random walk, 67, 141 
recurrence of, 141 
refiection principle, 69 
regular, 110 
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relative frequency, 1 
roulette, 79 

sample space, 3, 90 
secretary’s problem, 65 
set, 6 

finite-dimensional, 134 
ordered, 6 
small, 134, 135 
unordered, 6 
Simpson’s paradox, 30 
size bias, 25 

St. Petersburg paradox, 41 
standard deviation, 45 
stationarity, 147 
statistical application, 159, 174 
Stirling’s formula, 72 
sum, 52 

of Poisson distributions, 53 
of two random variables, 52, 111 

uncountable, 84 
union, 4 

variance, 45, 59, 115 
of a sum, 45, 115 
of binomial distribution, 47 
of normal distribution, 115 
of Poisson distribution, 48 
of uniform distribution, 115 

waiting time, 91, 151 
waiting time paradox, 156 
well defined, 41 
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